You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+52-13Lines changed: 52 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,31 @@
5
5
6
6
A library implementing different string similarity, distance and sortMatch measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
7
7
8
-
[TOC]
8
+
-[string-comparison](#string-comparison)
9
+
-[Download & Usage](#download--usage)
10
+
-[OverView](#overview)
11
+
-[Normalized, metric, similarity and distance](#normalized-metric-similarity-and-distance)
12
+
-[(Normalized) similarity and distance](#normalized-similarity-and-distance)
13
+
-[Levenshtein](#levenshtein)
14
+
-[Longest Common Subsequence](#longest-common-subsequence)
15
+
-[Metric Longest Common Subsequence](#metric-longest-common-subsequence)
The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.
Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance. Therefore the library defines some interfaces to categorize them.
Like Q-Gram distance, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Each input string is simply a set of n-grams. The Jaccard index is then computed as |V1 inter V2| / |V1 union V2|.
@@ -150,45 +178,58 @@ Distance is computed as 1 - similarity.
150
178
Jaccard index is a metric distance.
151
179
152
180
## Sorensen-Dice coefficient
181
+
153
182
Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 inter V2| / (|V1| + |V2|).
154
183
155
184
Distance is computed as 1 - similarity.
156
185
157
-
158
186
## API
159
187
*`similarity`.
160
188
*`distance`.
161
189
*`sortMatch`
162
190
163
-
### `similarity`
191
+
### similarity
164
192
165
193
Implementing algorithms define a similarity between strings
166
194
167
-
#### Params
195
+
#### params
168
196
169
197
1. thanos [String]
170
198
2. rival [String]
171
199
172
-
#### Return
200
+
#### return
173
201
174
202
Return a similarity between 0.0 and 1.0
175
203
176
-
### `distance`
204
+
### distance
177
205
178
206
Implementing algorithms define a distance between strings (0 means strings are identical)
0 commit comments