Skip to content

Han unification regex is incorrect #107

@KarolS

Description

@KarolS

The regex tries to use 5-digit Unicode escapes, but Unicode escapes are only 4-digit, which makes it not work.

For example this fragment: \u20000-\u2A6DF is interpreted as 3 Unicode ranges:

  • U+2000 (which is not a Han character)

  • from 0 (U+0030) to U-2A6D (which encompasses tons of various characters, including the entire Latin alphabet, but no Han characters)

  • F (U+0045)

I guess the regex should be rewritten using surrogates, like the emoji one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions