Skip to content

Parser enhancements #9

@Gentle

Description

@Gentle

Hello, I am very interested in this project, it feels easy and fast to use and I would love to replace hunspell in my codebase since it is a pain to compile for mac or wasm environments :)

I have been exploring the parser by basically installing all hunspell-* packages on debian and then trying to load everything in /usr/share/hunspell

some observations:

  • generally, some files start with a BOM, which can be stripped easily dfa77ad
  • en_GB
    • the infamous lines zverok used to explain the issue about comments fails to parse
      COMPOUNDRULE 2
      COMPOUNDRULE #*0{
      COMPOUNDRULE #*@}
      
      while there is a test that shows #*0{ can be parsed as flags, the parser errors because everything after the first " #" is considered a comment and filtered out early
      to make this work I followed zverok's suggestion and refactored the comment handling to not strip comments early but instead allow lines to end with arbitrary unparsed data, turning comments into a side-effect since this seems to be how upstream handles it 84b41ea
  • da_DK
    • uses "" around words for words that should contain a /
      all occurences: "A/S", "c/o", "I/S", I think this does not warrant adding support for a non-feature one dictionary invented
    • contains one interestingly malformed line
      "FedEx/9 ph:https://denstoredanske.lex.dk/FedEx"Fedkrog/54,9
      I think this was supposed to be
      FedEx/9 ph:https://denstoredanske.lex.dk/FedEx
      Fedkrog/54,9
      
  • tr_TR
    • uses FLAG num and 0-based indexing, which breaks the NonZero type.
      can be converted by replacing 0 with u16::MAX or any other unused number

for da_DK and tr_TR I just patched the dictionaries, the only dictionary that is problematic now is hu_HU but those files are actually binary and I think it is better to convert them manually to SET UTF8 FLAG num format so that modern parsers don't need the support for legacy binary files

Do you want those two patches dfa77ad 84b41ea
as one or two PRs? And thank you for the library

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions