Skip to content

URL parsing fails for valid links containing non-ASCII characters in domain #56

@HPRILLER

Description

@HPRILLER

I’ve encountered an issue where some valid URLs containing non-ASCII characters in their domain names are not parsed correctly.

"http://faß.de"

Expected: http://faß.de
Actual: http://fa/

"http://نامه‌ای.com"

Expected: http://نامه‌ای.com
Actual: http://نامه and ای.com

http://ශ්‍රී.com

Expected: http://ශ්‍රී.com
Actual: http://ශ් and රී.com

It seems the current parsing logic might not fully support IDN (Internationalized Domain Names) or certain Unicode characters in URLs.

Environment:

PHP version: 7.3

Library version: 3.12 (latest)

Possible cause:

The regex or parsing method used to detect URLs might not be Unicode-aware for domain name parts, leading to incorrect splitting or truncation.

Suggestion:

Consider updating the regex to fully support Unicode letters and IDN domains, possibly using \p{L} for letters and ensuring the u (UTF-8) regex modifier is used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions