Mercurial
DevDirective » Articles » DiffSeries The code and article markdown for my Diff article series on DevDirective
Clone URL:  
Pushed to one repository · View In Graph Contained in tip

Initial commit of part 2 of the series.

Changeset 4ce4a12f9920

Parent 4cb62c0f12ee

by Profile picture of Lasse V. KarlsenLasse V. Karlsen

Changes to 2 files · Browse files at 4ce4a12f9920 Showing diff from parent 4cb62c0f12ee Diff from another changeset...

Change 1 of 1 Show Entire File Part 2/​Diff article series, part 2.md
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
 
@@ -0,0 +1,256 @@
+Creating a reusable, though simple, diff-implementation in C#, part 2 +=== + +> This is part 2 of an article series where I will walk you through creating a simple and reusable diff-algorithm for your .NET projects. + +Previously, on DevDirective... +-- + +If you haven't already read [part 1][1], I strongly suggest you do so now and come back here afterwards. It will make this part, and subsequent ones, much easier to understand when you have all the context. + +Repository +-- + +I've made the article markdown texts, as well as the code for these articles, available on my Kiln account. + +You can download a [LINQPad][3] program that contains all the code in this article, in executable and testable form, so if you want to avoid having to type it all out, go download [LINQPad][3] (it's free, and *excellent!*) and the program and experiment! + +[The source code can be downloaded here][4]. Note that while the Kiln website is a Mercurial front-end, you do not need Mercurial in order to download the code. Simply + +Overview +-- + +In this part I'm going to implement a very basic version of the [Longest Common Substring][2] algorithm. In a later part of this article series I will revisit this implementation in order to improve it, but to begin with, let's make it simple so that it is easy to explain, write, test, and not least, *understand*. + +In the [first article][1], I used the following two pieces of text: + +<pre> +This long piece of text will have a common part found by LCS. +This extra long piece of text will have some common parts found by LCS. +</pre> + +I'm going to reuse those for this part and subsequent ones, so that we can see that the final outcome indeed matches the one I predicted in the first article in the series. + +Also remember that I mentioned that the code I will write to implement the Diff algorithm should handle *any* type of objects. Therefore, though I will be using texts, and thus the `string` type in .NET for the examples, the code will be written using generics. + +Goal +-- + +I will implement a LongestCommonSubstring method that can: + +1. Take two collections of type T, where T is any type +2. Take a `IEqualityComparer<T>` object, that will be used to determine if two T's are the same +3. Return the result, indicating whether the method was able to find a longest-common-substring or not, and if it did, where it was found in the two collections and how long it is + +Return values +-- + +For the 3rd goal there I will create a new type, so let's start with that. + +I will need to create a type that: + +1. Contains a `Success` property, of type `bool` that indicates whether we found a LCS or not +2. In case of success, the position of the LCS in the first collection that was passed to the method +3. In case of success, same for the second collection +4. In case of success, the length of the LCS + +In the case of "not a success", the three fields for the positions and the length are undefined but we'll ensure these have the value 0 in the current implementation. + +Here's how I see this type: + + public struct LongestCommonSubstringResult + { + private readonly bool _Success; + private readonly int _PositionInFirstCollection; + private readonly int _PositionInSecondCollection; + private readonly int _Length; + + public LongestCommonSubstringResult( + int positionInFirstCollection, + int positionInSecondCollection, + int length) + { + _Success = true; + _PositionInFirstCollection = positionInFirstCollection; + _PositionInSecondCollection = positionInSecondCollection; + _Length = length; + } + + public bool Success + { + get + { + return _Success; + } + } + + public int PositionInFirstCollection + { + get + { + return _PositionInFirstCollection; + } + } + + public int PositionInSecondCollection + { + get + { + return _PositionInSecondCollection; + } + } + + public int Length + { + get + { + return _Length; + } + } + + public override string ToString() + { + if (Success) + return string.Format( + "LCS ({0}, {1} x{2})", + PositionInFirstCollection, + PositionInSecondCollection, + Length); + else + return "LCS (-)"; + } + } + } + +I made this a `struct` since we will potentially have quite a few of these, and they're rather small (3 ints and a bool), so I didn't want to build up any pressure on the garbage collector. The type is also immutable. + +Since it is a `struct` there will be an implicit public parameterless constructor that initializes the whole `struct` to zeroes, which means that the length and positions will all be zero, and the Success will be false. Perfect. + +Thus, if the method that implements the longest common substring fails to find a common substring, here's what it would return: + + return new LongestCommonSubstringResult(); + +and if it *does* find a common substring, here's what it would return instead: + + return new LongestCommonSubstringResult(position1, position2, length); + +Longest common substring method +-- + +Now on to the method that our Diff implementation needs, the one that implements the [longest common substring][1] algorithm. + +As I said earlier, this will be a generic method, with a naive and simple implementation. It will be simple to understand, but it will perform rather horribly. Still, for this article series, at least until a bit later in the series, this is more than enough. + +This is how I see the method as being defined: + + public LongestCommonSubstringResult FindLongestCommonSubstring<T>( + IList<T> firstCollection, + IList<T> secondCollection, + IEqualityComparer<T> equalityComparer) + +Since it is imperative that we can navigate a bit back and forth in this implementation, I made it clear right in the parameter list that I need something that is indexable, otherwise I would've declared the two collections as `IEnumerable<T>` instead. This has the unfortunate consequence, however, that we cannot call the method passing in two strings, as the `string` type does not implement `IList<T>`. + +I will create a couple of overloads for the method, however, to make it simpler to call for the common cases: + + public LongestCommonSubstringResult FindLongestCommonSubstring( + string firstText, + string secondText) + { + return FindLongestCommonSubstring<char>( + firstText.ToCharArray(), + secondText.ToCharArray()); + } + + public LongestCommonSubstringResult FindLongestCommonSubstring<T>( + IList<T> firstCollection, + IList<T> secondCollection) + { + return FindLongestCommonSubstring( + firstCollection, + secondCollection, + EqualityComparer<T>.Default); + } + +With these two overloads we can easily compare strings, and for the generic version we don't have to pass in the `IEqualityComparer<T>` object, as long as we can make do with the default implementation for the type `T` that is being used. + +With the above code in mind, let's write our super-naive and slow implementation of the [longest common substring][2]. + +Basically, I will loop through the first collection, and for each position in the first collection, I will loop through the entire second collection. Every time I find an element in the two collections that compare equal, I will count how many consecutive elements compare equal, and keep the longest one found. + +If we manage to get to the end of the loop without having found any equal elements, there is no common substring between the two collections. + +The performance characteristics of this implementation is rather bad. At the very least we will do N*M comparisons, where N and M are the number of elements in the two respective collections. At the most, in a degenerative case (two collections consisting only of elements that are equal, ie. the same element repeated over and over again), we can end up doing N*(N+1)/2*M comparisons. Later in this article series I will speed up this somewhat, but this is an excellent place for some optimizations. I will leave that as an exercise for you, the reader, for the moment. + +Here's the code. I broke the code that counts the equal elements out into its own method to make the code easier to read: + + public LongestCommonSubstringResult FindLongestCommonSubstring<T>( + IList<T> firstCollection, + IList<T> secondCollection, + IEqualityComparer<T> equalityComparer) + { + // default result, if we can't find anything + var result = new LongestCommonSubstringResult(); + + for (int index1 = 0; index1 < firstCollection.Count; index1++) + { + for (int index2 = 0; index2 < secondCollection.Count; index2++) + { + if (equalityComparer.Equals(firstCollection[index1], secondCollection[index2])) + { + int length = CountEqual( + firstCollection, index1, + secondCollection, index2, + equalityComparer); + + // Is this longer than what we already have? Yes --> record new LCS + if (length > result.Length) + { + result = new LongestCommonSubstringResult(index1, index2, length); + } + } + } + } + + return result; + } + + public int CountEqual<T>( + IList<T> firstCollection, int firstPosition, + IList<T> secondCollection, int secondPosition, + IEqualityComparer<T> equalityComparer) + { + int length = 0; + while (firstPosition < firstCollection.Count && secondPosition < secondCollection.Count) + { + if (!equalityComparer.Equals( + firstCollection[firstPosition], + secondCollection[secondPosition])) + { + break; + } + + firstPosition++; + secondPosition++; + length++; + } + return length; + } + +Now, let's test this code. + + string text1 = "This long piece of text will have a common part found by LCS."; + string text2 = "This extra long piece of text will have some common parts found by LCS."; + + var result = FindLongestCommonSubstring(text1, text2); + +The contents of result should now correspond to this: + + Success = true + PositionInFirstCollection = 4 + PositionInSecondCollection = 10 + Length = 30 + + [1]: http://devdirective.com/post/91/creating-a-reusable-though-simple-diff-implementation-in-csharp-part-1 + [2]: http://en.wikipedia.org/wiki/Longest_common_substring_problem + [3]: http://linqpad.net + [4]: https://lassevk.kilnhg.com/Repo/DevDirective/DiffSeries/Repository \ No newline at end of file
Change 1 of 1 Show Entire File Part 2/​LongestCommonSubstring.linq
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
@@ -0,0 +1,145 @@
+<Query Kind="Program" /> + +void Main() +{ + string text1 = "This long piece of text will have a common part found by LCS."; + string text2 = "This extra long piece of text will have some common parts found by LCS."; + + var result = FindLongestCommonSubstring(text1, text2); + result.Dump(); +} + +public LongestCommonSubstringResult FindLongestCommonSubstring( + string firstText, + string secondText) +{ + return FindLongestCommonSubstring<char>( + firstText.ToCharArray(), + secondText.ToCharArray()); +} + +public LongestCommonSubstringResult FindLongestCommonSubstring<T>( + IList<T> firstCollection, + IList<T> secondCollection) +{ + return FindLongestCommonSubstring( + firstCollection, + secondCollection, + EqualityComparer<T>.Default); +} + +public LongestCommonSubstringResult FindLongestCommonSubstring<T>( + IList<T> firstCollection, + IList<T> secondCollection, + IEqualityComparer<T> equalityComparer) +{ + // default result, if we can't find anything + var result = new LongestCommonSubstringResult(); + + for (int index1 = 0; index1 < firstCollection.Count; index1++) + { + for (int index2 = 0; index2 < secondCollection.Count; index2++) + { + if (equalityComparer.Equals(firstCollection[index1], secondCollection[index2])) + { + int length = CountEqual( + firstCollection, index1, + secondCollection, index2, + equalityComparer); + + // Is this longer than what we already have? Yes --> record new LCS + if (length > result.Length) + { + result = new LongestCommonSubstringResult(index1, index2, length); + } + } + } + } + + return result; +} + +public int CountEqual<T>( + IList<T> firstCollection, int firstPosition, + IList<T> secondCollection, int secondPosition, + IEqualityComparer<T> equalityComparer) +{ + int length = 0; + while (firstPosition < firstCollection.Count && secondPosition < secondCollection.Count) + { + if (!equalityComparer.Equals( + firstCollection[firstPosition], + secondCollection[secondPosition])) + { + break; + } + + firstPosition++; + secondPosition++; + length++; + } + return length; +} + +public struct LongestCommonSubstringResult +{ + private readonly bool _Success; + private readonly int _PositionInFirstCollection; + private readonly int _PositionInSecondCollection; + private readonly int _Length; + + public LongestCommonSubstringResult( + int positionInFirstCollection, + int positionInSecondCollection, + int length) + { + _Success = true; + _PositionInFirstCollection = positionInFirstCollection; + _PositionInSecondCollection = positionInSecondCollection; + _Length = length; + } + + public bool Success + { + get + { + return _Success; + } + } + + public int PositionInFirstCollection + { + get + { + return _PositionInFirstCollection; + } + } + + public int PositionInSecondCollection + { + get + { + return _PositionInSecondCollection; + } + } + + public int Length + { + get + { + return _Length; + } + } + + public override string ToString() + { + if (Success) + return string.Format( + "LCS ({0}, {1} x{2})", + PositionInFirstCollection, + PositionInSecondCollection, + Length); + else + return "LCS (-)"; + } +}