Skip to content

Commit 9864ee6

Browse files
author
周哲超
committed
docs: change readme
1 parent 1fe2b96 commit 9864ee6

File tree

1 file changed

+115
-3
lines changed

1 file changed

+115
-3
lines changed

README.md

Lines changed: 115 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
1+
12
# string-comparison
23

34
**JavaScript implementation of [tdebatty/java-string-similarity](https://github.com/tdebatty/java-string-similarity)**
45

56
A library implementing different string similarity, distance and sortMatch measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...
67

8+
[TOC]
9+
710
## Download & Usage
811

912
download
@@ -41,6 +44,115 @@ The main characteristics of each implemented algorithm are presented below. The
4144
| [Levenshtein](https://github.com/luozhouyang/python-string-similarity/blob/master/README.md#levenshtein) | similarity<br />distance<br />sortMatch | No | Yes | | O(m*n) | |
4245
| [Jaro-Winkler](https://github.com/luozhouyang/python-string-similarity/blob/master/README.md#jaro-winkler) | similarity distance<br />sortMatch | Yes | No | | O(m*n) | typo correction |
4346

47+
## Normalized, metric, similarity and distance
48+
Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance. Therefore the library defines some interfaces to categorize them.
49+
50+
### (Normalized) similarity and distance
51+
52+
- StringSimilarity : Implementing algorithms define a similarity between strings (0 means strings are completely different).
53+
- NormalizedStringSimilarity : Implementing algorithms define a similarity between 0.0 and 1.0, like Jaro-Winkler for example.
54+
- StringDistance : Implementing algorithms define a distance between strings (0 means strings are identical), like Levenshtein for example. The maximum distance value depends on the algorithm.
55+
- NormalizedStringDistance : This interface extends StringDistance. For implementing classes, the computed distance value is between 0.0 and 1.0. NormalizedLevenshtein is an example of NormalizedStringDistance.
56+
57+
## Levenshtein
58+
59+
The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
60+
61+
It is a metric string distance. This implementation uses dynamic programming (Wagner–Fischer algorithm), with only 2 rows of data. The space requirement is thus O(m) and the algorithm runs in O(m.n).
62+
63+
```js
64+
const Thanos = 'healed'
65+
const Rival = 'sealed'
66+
const Avengers = ['edward', 'sealed', 'theatre']
67+
let ls = Similarity.levenshtein
68+
69+
console.log(ls.similarity(Thanos, Rival))
70+
console.log(ls.distance(Thanos, Rival))
71+
console.log(ls.sortMatch(Thanos, Avengers))
72+
73+
// output
74+
0.8333333333333334
75+
1
76+
[
77+
{ member: 'edward', index: 0, rating: 0.16666666666666663 },
78+
{ member: 'theatre', index: 2, rating: 0.4285714285714286 },
79+
{ member: 'sealed', index: 1, rating: 0.8333333333333334 }
80+
]
81+
```
82+
83+
84+
## Longest Common Subsequence
85+
86+
The longest common subsequence (LCS) problem consists in finding the longest subsequence common to two (or more) sequences. It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences.
87+
88+
It is used by the diff utility, by Git for reconciling multiple changes, etc.
89+
90+
The LCS distance between strings X (of length n) and Y (of length m) is n + m - 2 |LCS(X, Y)|
91+
min = 0
92+
max = n + m
93+
94+
LCS distance is equivalent to Levenshtein distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.
95+
96+
This class implements the dynamic programming approach, which has a space requirement O(m.n), and computation cost O(m.n).
97+
98+
In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.
99+
100+
```js
101+
const Thanos = 'healed'
102+
const Rival = 'sealed'
103+
const Avengers = ['edward', 'sealed', 'theatre']
104+
let lcs = Similarity.lcs
105+
106+
console.log(lcs.similarity(Thanos, Rival))
107+
console.log(lcs.distance(Thanos, Rival))
108+
console.log(lcs.sortMatch(Thanos, Avengers))
109+
110+
// output
111+
0.8333333333333334
112+
2
113+
[
114+
{ member: 'edward', index: 0, rating: 0.5 },
115+
{ member: 'theatre', index: 2, rating: 0.6153846153846154 },
116+
{ member: 'sealed', index: 1, rating: 0.8333333333333334 }
117+
]
118+
```
119+
120+
## Metric Longest Common Subsequence
121+
Distance metric based on Longest Common Subsequence, from the notes "An LCS-based string metric" by Daniel Bakkelund.
122+
http://heim.ifi.uio.no/~danielry/StringMetric.pdf
123+
124+
The distance is computed as 1 - |LCS(s1, s2)| / max(|s1|, |s2|)
125+
126+
```js
127+
const Thanos = 'healed'
128+
const Rival = 'sealed'
129+
const Avengers = ['edward', 'sealed', 'theatre']
130+
let mlcs = Similarity.mlcs
131+
132+
console.log(mlcs.similarity(Thanos, Rival))
133+
console.log(mlcs.distance(Thanos, Rival))
134+
console.log(mlcs.sortMatch(Thanos, Avengers))
135+
136+
// output
137+
0.8333333333333334
138+
0.16666666666666663
139+
[
140+
{ member: 'edward', index: 0, rating: 0.5 },
141+
{ member: 'theatre', index: 2, rating: 0.5714285714285714 },
142+
{ member: 'sealed', index: 1, rating: 0.8333333333333334 }
143+
]
144+
```
145+
## Cosine similarity
146+
147+
Like Q-Gram distance, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Each input string is simply a set of n-grams. The Jaccard index is then computed as |V1 inter V2| / |V1 union V2|.
148+
149+
Distance is computed as 1 - similarity.
150+
Jaccard index is a metric distance.
151+
152+
## Sorensen-Dice coefficient
153+
Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 inter V2| / (|V1| + |V2|).
154+
155+
Distance is computed as 1 - similarity.
44156

45157

46158
## API
@@ -97,17 +209,17 @@ Return an array of objects. ex:
97209

98210
## Release Notes
99211

100-
### 1.0 version
212+
### 1.x version
101213
* Basic building
102214
* Cosine
103215
* DiceCoefficient
104216
* JaccardIndex
105217
* Levenshtein
106218
* LongestCommonSubsequence
107219
* MetricLCS
220+
* Add function sortMatch()
221+
108222

109-
### 2.0 version
110-
* none
111223

112224

113225
## MIT

0 commit comments

Comments
 (0)