Skip to content

Incorrect forced encoding for Regexp with a Unicode property/script #2620

@andrykonchin

Description

@andrykonchin

for instance the following regexps have UTF-8 encoding in CRuby:

/\p{Arabic}/.encoding # => #<Encoding:UTF-8>
/\p{L}/.encoding # => #<Encoding:UTF-8>

but Prism sets the forced_us_ascii_encoding flag:

bin/parse -e '/\p{L}/'
@ ProgramNode (location: (1,0)-(1,7))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,7))
    └── body: (length: 1)
        └── @ RegularExpressionNode (location: (1,0)-(1,7))
            ├── flags: forced_us_ascii_encoding
            ├── opening_loc: (1,0)-(1,1) = "/"
            ├── content_loc: (1,1)-(1,6) = "\\p{L}"
            ├── closing_loc: (1,6)-(1,7) = "/"
            └── unescaped: "\\p{L}"

and

bin/parse -e '/\p{Arabic}/'
@ ProgramNode (location: (1,0)-(1,12))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,12))
    └── body: (length: 1)
        └── @ RegularExpressionNode (location: (1,0)-(1,12))
            ├── flags: forced_us_ascii_encoding
            ├── opening_loc: (1,0)-(1,1) = "/"
            ├── content_loc: (1,1)-(1,11) = "\\p{Arabic}"
            ├── closing_loc: (1,11)-(1,12) = "/"
            └── unescaped: "\\p{Arabic}"

Related issue - #1997

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions