Skip to content

Commit 560d861

Browse files
committed
added documentation
1 parent 8ba4957 commit 560d861

File tree

9 files changed

+46
-18
lines changed

9 files changed

+46
-18
lines changed

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,14 +64,16 @@ The main characteristics of each implemented algorithm are presented below. The
6464
| Weighted Levenshtein |distance | No | No | | O(m.n) |
6565
| Damerau-Levenshtein |distance | No | No | | O(m.n) |
6666
| Jaro-Winkler |similarity<br>distance | Yes | No | | O(m.n) |
67-
| Longest Common Subsequence |distance | No | No | | O(m.n) |
67+
| Longest Common Subsequence |distance | No | No | | O(m.n)* |
6868
| Metric Longest Common Subsequence |distance | Yes | No | | O(m.n) |
6969
| N-Gram (Kondrak) |distance | Yes | No | | O(m.n) |
7070
| Q-Gram |distance | No | No | Profile | O(m+n) |
7171
| Cosine |similarity<br>distance | Yes | No | Profile | O(m+n) |
7272
| Jaccard |similarity<br>distance | Yes | Yes | Set | O(m+n) |
7373
| Sorensen-Dice |similarity<br>distance | Yes | No | Set | O(m+n) |
7474

75+
\* In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.
76+
7577
## Levenshtein
7678
The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
7779

@@ -233,7 +235,9 @@ max = n + m
233235

234236
LCS distance is equivalent to Levenshtein distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.
235237

236-
This class currently implements the dynamic programming approach, which has a space requirement O(m.n), and computation cost O (m.n)
238+
This class implements the dynamic programming approach, which has a space requirement O(m.n), and computation cost O(m.n).
239+
240+
In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.
237241

238242
```java
239243
import info.debatty.java.stringsimilarity.*;

src/main/java/info/debatty/java/stringsimilarity/Cosine.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@
2828
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
2929

3030
/**
31+
* The similarity between the two strings is the cosine of the angle between
32+
* these two vectors representation. It is computed as V1 . V2 / (|V1| * |V2|)
33+
* The cosine distance is computed as 1 - cosine similarity.
3134
* @author Thibault Debatty
3235
*/
3336
public class Cosine extends ShingleBased implements

src/main/java/info/debatty/java/stringsimilarity/Jaccard.java

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,13 @@
2929
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
3030

3131
/**
32-
*
32+
* Each input string is converted into a set of n-grams, the Jaccard index is
33+
* then computed as |V1 inter V2| / |V1 union V2|.
34+
* Like Q-Gram distance, the input strings are first converted into sets of
35+
* n-grams (sequences of n characters, also called k-shingles), but this time
36+
* the cardinality of each n-gram is not taken into account.
37+
* Distance is computed as 1 - cosine similarity.
38+
* Jaccard index is a metric distance.
3339
* @author Thibault Debatty
3440
*/
3541
public class Jaccard extends ShingleBased implements

src/main/java/info/debatty/java/stringsimilarity/JaroWinkler.java

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,15 @@
55
import java.util.Arrays;
66

77
/**
8-
*
9-
* @author tibo
8+
* The Jaro–Winkler distance metric is designed and best suited for short
9+
* strings such as person names, and to detect typos; it is (roughly) a
10+
* variation of Damerau-Levenshtein, where the substitution of 2 close
11+
* characters is considered less important then the substitution of 2 characters
12+
* that a far from each other.
13+
* Jaro-Winkler was developed in the area of record linkage (duplicate
14+
* detection) (Winkler, 1990). It returns a value in the interval [0.0, 1.0].
15+
* The distance is computed as 1 - Jaro-Winkler similarity.
16+
* @author Thibault Debatty
1017
*/
1118
public class JaroWinkler implements NormalizedStringSimilarity, NormalizedStringDistance {
1219

src/main/java/info/debatty/java/stringsimilarity/NormalizedLevenshtein.java

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,10 @@
2828
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
2929

3030
/**
31-
*
31+
* This distance is computed as levenshtein distance divided by the length of
32+
* the longest string. The resulting value is always in the interval [0.0 1.0]
33+
* but it is not a metric anymore!
34+
* The similarity is computed as 1 - normalized distance.
3235
* @author Thibault Debatty
3336
*/
3437
public class NormalizedLevenshtein implements NormalizedStringDistance, NormalizedStringSimilarity {

src/main/java/info/debatty/java/stringsimilarity/QGram.java

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,14 @@
22

33

44
import info.debatty.java.stringsimilarity.interfaces.StringDistance;
5-
import info.debatty.java.utils.SparseIntegerVector;
65

76
/**
7+
* Q-gram distance, as defined by Ukkonen in "Approximate string-matching with
8+
* q-grams and maximal matches". The distance between two strings is defined as
9+
* the L1 norm of the difference of their profiles (the number of occurences of
10+
* each n-gram): SUM( |V1_i - V2_i| ). Q-gram distance is a lower bound on
11+
* Levenshtein distance, but can be computed in O(m + n), where Levenshtein
12+
* requires O(m.n).
813
* @author Thibault Debatty
914
*/
1015
public class QGram extends ShingleBased implements StringDistance {

src/main/java/info/debatty/java/stringsimilarity/SorensenDice.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,9 @@
2828
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
2929

3030
/**
31-
*
31+
* Similar to Jaccard index, but this time the similarity is computed as
32+
* 2 * |V1 inter V2| / (|V1| + |V2|).
33+
* Distance is computed as 1 - cosine similarity.
3234
* @author Thibault Debatty
3335
*/
3436
public class SorensenDice extends ShingleBased implements

src/main/java/info/debatty/java/stringsimilarity/StringProfile.java

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@
2727
import info.debatty.java.utils.SparseIntegerVector;
2828

2929
/**
30-
* Profile of a string, computed using shingling.
30+
* Profile of a string (number of occurences of each shingle/n-gram), computed
31+
* using shingling.
3132
*
3233
* @author Thibault Debatty
3334
*/
@@ -59,7 +60,7 @@ public StringProfile(SparseIntegerVector vector, KShingling ks) {
5960
/**
6061
*
6162
* @param other
62-
* @return
63+
* @return cosine similarity between this string and the other
6364
* @throws java.lang.Exception
6465
*/
6566
public double cosineSimilarity(StringProfile other) throws Exception {
@@ -73,7 +74,7 @@ public double cosineSimilarity(StringProfile other) throws Exception {
7374
/**
7475
*
7576
* @param other
76-
* @return
77+
* @return qgram distance between this string and the other
7778
* @throws Exception
7879
*/
7980
public double qgramDistance(StringProfile other) throws Exception {
@@ -113,13 +114,9 @@ public String[] getMostFrequentNGrams(int number) {
113114
smallest_frequency = frequencies[j];
114115
}
115116
}
116-
117117
}
118118

119119
}
120-
121120
return strings;
122-
123-
124121
}
125122
}

src/main/java/info/debatty/java/stringsimilarity/StringSet.java

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/*
22
* The MIT License
33
*
4-
* Copyright 2015 tibo.
4+
* Copyright 2015 Thibault Debatty.
55
*
66
* Permission is hereby granted, free of charge, to any person obtaining a copy
77
* of this software and associated documentation files (the "Software"), to deal
@@ -27,8 +27,9 @@
2727
import info.debatty.java.utils.SparseBooleanVector;
2828

2929
/**
30-
*
31-
* @author tibo
30+
* Set representation of a string (list of occuring shingles/n-grams), without
31+
* cardinality.
32+
* @author Thibault Debatty
3233
*/
3334
public class StringSet {
3435
private final SparseBooleanVector vector;

0 commit comments

Comments
 (0)