Skip to content

Figure out how to treat diacritics better #201

@dimus

Description

@dimus

@abubelinha raised the following in #199:

In summary, for the ö case, I think o is a much more conservative approach than oe (which looks like a germanic phonetic replacement, but gnparser does not do that in other cases like ñ, which is replaced by n despite it sounds more like ny in Spanish).

New comment now:

As there could be different opinions about this, I wonder if in a future version it could be possible to feed gnparser with an array of replacements (i.e. a config file, or something we can post through the api) so we can force it to turn ó/ò/ô/ø/ö into o (instead of oe), п/ñ into n, г into r, and so on (a user choice to override defaults).

Perhaps the cyrillic characters issue (keyboard-originated / OCR-originated / orthographic corrector-originated?) could be frequent in some scenarios, and it would be good letting gnparser correct this when we know it's happening.
Ortographic correctors have the side effect of putting first-letter uppercases in some of your words (after "subsp." or "var."); and depending on the orthographic corrector language, they could be the origin of some of the accented characters in latin names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions