Commit 1b12faa
committed
Make the fallthrough character tokenization also capture unpaired surrogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency
Addresses #1298
Add a debug line for the fallthrough rule
Add a couple tests of the half codepoint fix1 parent 63fda49 commit 1b12faa
File tree
3 files changed
+57005
-57323
lines changed- src/edu/stanford/nlp/process
- test/src/edu/stanford/nlp/process
3 files changed
+57005
-57323
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1583 | 1583 | | |
1584 | 1584 | | |
1585 | 1585 | | |
1586 | | - | |
| 1586 | + | |
| 1587 | + | |
1587 | 1588 | | |
1588 | 1589 | | |
1589 | 1590 | | |
| |||
0 commit comments