Skip to content
This repository was archived by the owner on Sep 29, 2023. It is now read-only.

Commit e1e9b35

Browse files
author
Jonas Chapuis
committed
added information about FuzzyParsers in README.md
1 parent 247b3c5 commit e1e9b35

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

README.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,3 +230,97 @@ Completing on the same expressions (`2`, `2+`, `(10*2`) now leads to the followi
230230
|`%?` |defines the description of the completions tag|<code>("+" &#124; "-") %? "arithmetic operators"</code>|
231231
|`%%` |defines the completions tag kind (can be used to encode properties for the tag, e.g. visual decorations)|<code>("+" &#124; "-") %% "style: highlight"</code>|
232232
|`%-%` |defines the completion kind (can be used to encode properties for each completion entry, e.g. visual decorations)|<code>("+" %-% "style: highlight" &#124; "-")</code>|
233+
234+
## Fuzzy completion
235+
236+
This library also provides special parsers which support fuzzy completion, present in the `FuzzyParsers` trait, by means of the `oneOfTerms` method capable of fuzzing completion on the input to match a set of terms.
237+
(note that parsing itself obviously requires an exact match and is really fast thanks to a prefix trie lookup on each input char). For instance, with the following dummy grammar:
238+
239+
```scala
240+
object Grammar extends FuzzyParsers {
241+
val fuzzyCountries = "my favourite country is " ~ oneOfTerms(Seq("United States of America", "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Antigua & Deps", "Argentina", "Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas", "Bahrain", "Bangladesh", "Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia", "Bosnia Herzegovina", "Botswana", "Brazil", "Brunei", "Bulgaria", "Burkina", "Burma", "Burundi", "Cambodia", "Cameroon", "Canada", "Cape Verde", "Central African Rep", "Chad", "Chile", "People's Republic of China", "Republic of China", "Colombia", "Comoros", "Democratic Republic of the Congo", "Republic of the Congo", "Costa Rica,", "Croatia", "Cuba", "Cyprus", "Czech Republic", "Danzig", "Denmark", "Djibouti", "Dominica", "Dominican Republic", "East Timor", "Ecuador", "Egypt", "El Salvador", "Equatorial Guinea", "Eritrea", "Estonia", "Ethiopia", "Fiji", "Finland", "France", "Gabon", "Gaza Strip", "The Gambia", "Georgia", "Germany", "Ghana", "Greece", "Grenada", "Guatemala", "Guinea", "Guinea-Bissau", "Guyana", "Haiti", "Holy Roman Empire", "Honduras", "Hungary", "Iceland", "India", "Indonesia", "Iran", "Iraq", "Republic of Ireland", "Israel", "Italy", "Ivory Coast", "Jamaica", "Japan", "Jonathanland", "Jordan", "Kazakhstan", "Kenya", "Kiribati", "North Korea", "South Korea", "Kosovo", "Kuwait", "Kyrgyzstan", "Laos", "Latvia", "Lebanon", "Lesotho", "Liberia", "Libya", "Liechtenstein", "Lithuania", "Luxembourg", "Macedonia", "Madagascar", "Malawi", "Malaysia", "Maldives", "Mali", "Malta", "Marshall Islands", "Mauritania", "Mauritius", "Mexico", "Micronesia", "Moldova", "Monaco", "Mongolia", "Montenegro", "Morocco", "Mount Athos", "Mozambique", "Namibia", "Nauru", "Nepal", "Newfoundland", "Netherlands", "New Zealand", "Nicaragua", "Niger", "Nigeria", "Norway", "Oman", "Ottoman Empire", "Pakistan", "Palau", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines", "Poland", "Portugal", "Prussia", "Qatar", "Romania", "Rome", "Russian Federation", "Rwanda", "St Kitts & Nevis", "St Lucia", "Saint Vincent & the", "Grenadines", "Samoa", "San Marino", "Sao Tome & Principe", "Saudi Arabia", "Senegal", "Serbia", "Seychelles", "Sierra Leone", "Singapore", "Slovakia", "Slovenia", "Solomon Islands", "Somalia", "South Africa", "Spain", "Sri Lanka", "Sudan", "Suriname", "Swaziland", "Sweden", "Switzerland", "Syria", "Tajikistan", "Tanzania", "Thailand", "Togo", "Tonga", "Trinidad & Tobago", "Tunisia", "Turkey", "Turkmenistan", "Tuvalu", "Uganda", "Ukraine", "United Arab Emirates", "United Kingdom", "Uruguay", "Uzbekistan", "Vanuatu", "Vatican City", "Venezuela", "Vietnam", "Yemen", "Zambia", "Zimbabwe"))
242+
}
243+
```
244+
245+
Performing the following completion:
246+
247+
```scala
248+
Grammar.completeString(Grammar.fuzzyCountries, "my favourite country is Swtlz")
249+
```
250+
251+
leads to this output:
252+
253+
```scala
254+
List(Sweden, Swaziland, Switzerland)
255+
```
256+
257+
`oneOfTerms` sets the similarity metric in the completion entry score, so that completions can be ordered:
258+
259+
```scala
260+
Grammar.complete(Grammar.fuzzyCountries, "my favourite country is Thld")
261+
```
262+
263+
leads to:
264+
265+
```json
266+
{
267+
"position": {
268+
"line": 1,
269+
"column": 25
270+
},
271+
"sets": [
272+
{
273+
"tag": {
274+
"label": "",
275+
"score": 0
276+
},
277+
"completions": [
278+
{
279+
"value": "Thailand",
280+
"score": 43
281+
},
282+
{
283+
"value": "The Gambia",
284+
"score": 25
285+
},
286+
{
287+
"value": "Jonathanland",
288+
"score": 22
289+
},
290+
{
291+
"value": "Chad",
292+
"score": 20
293+
},
294+
{
295+
"value": "Togo",
296+
"score": 20
297+
}
298+
]
299+
}
300+
]
301+
}
302+
```
303+
304+
### `oneOfTerms` parameters
305+
306+
Below the signature of the `oneOfTerms` method:
307+
308+
```scala
309+
def oneOfTerms(terms: Seq[String],
310+
similarityMeasure: (String, String) => Double = diceSorensenSimilarity,
311+
similarityThreshold: Int = DefaultSimilarityThreshold,
312+
maxCompletionsCount: Int = DefaultMaxCompletionsCount)
313+
```
314+
315+
- `terms`: the list of terms to build the parser for
316+
- `similarityMeasure`: the string similarity metric to be used. Any `(String, String) => Double` function can be passed in, but the library provides DiceSorensen (default), JaroWinkler, Leenshtein & NgramDistance. Metric choice depends on factors such as type of terms, performance, etc. See below for more information about the underlying data structure.
317+
- `similarityThreshold`: the minimum similarity score for an entry to be considered as a completion candidate
318+
- `maxCompletionsCount`: maximum number of completions returned by the parser
319+
320+
### Fuzzy matching technique
321+
For fuzzy completion, terms are decomposed in their trigrams and stored in a map indexed by the corresponding trigrams. This allows fast lookup of a set of completion candidates which share the same trigrams as the input. These candidates are ranked by the number of shared trigrams with the input, and a subset of the highest ranked candidates are kept. These candidates are then re-evaluated with the specified similarity metric (`similarityMeasure`), which is assumed to be more precise (and thus slower).
322+
323+
The top candidates according to `maxCompletionsCount` are returned as completions.
324+
325+
Note that terms are affixed so that the starting and ending two characters count more than the others in order to favor completions which start or end with the same characters as the input.
326+

0 commit comments

Comments
 (0)