-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Hello, I am very interested in this project, it feels easy and fast to use and I would love to replace hunspell in my codebase since it is a pain to compile for mac or wasm environments :)
I have been exploring the parser by basically installing all hunspell-* packages on debian and then trying to load everything in /usr/share/hunspell
some observations:
- generally, some files start with a BOM, which can be stripped easily dfa77ad
en_GB
- the infamous lines zverok used to explain the issue about comments fails to parse
while there is a test that shows
COMPOUNDRULE 2 COMPOUNDRULE #*0{ COMPOUNDRULE #*@}
#*0{
can be parsed as flags, the parser errors because everything after the first " #" is considered a comment and filtered out early
to make this work I followed zverok's suggestion and refactored the comment handling to not strip comments early but instead allow lines to end with arbitrary unparsed data, turning comments into a side-effect since this seems to be how upstream handles it 84b41ea
- the infamous lines zverok used to explain the issue about comments fails to parse
da_DK
- uses "" around words for words that should contain a /
all occurences:"A/S"
,"c/o"
,"I/S"
, I think this does not warrant adding support for a non-feature one dictionary invented - contains one interestingly malformed line
"FedEx/9 ph:https://denstoredanske.lex.dk/FedEx"Fedkrog/54,9
I think this was supposed to beFedEx/9 ph:https://denstoredanske.lex.dk/FedEx Fedkrog/54,9
- uses "" around words for words that should contain a /
tr_TR
- uses FLAG num and 0-based indexing, which breaks the NonZero type.
can be converted by replacing 0 with u16::MAX or any other unused number
- uses FLAG num and 0-based indexing, which breaks the NonZero type.
for da_DK and tr_TR I just patched the dictionaries, the only dictionary that is problematic now is hu_HU
but those files are actually binary and I think it is better to convert them manually to SET UTF8 FLAG num format so that modern parsers don't need the support for legacy binary files
Do you want those two patches dfa77ad 84b41ea
as one or two PRs? And thank you for the library
Metadata
Metadata
Assignees
Labels
No labels