Skip to content

Incorrect/incomplete extraction of URLs with special characters #4475

@stefan6419846

Description

@stefan6419846

Description

I just stumbled upon some files which have links in the following format:

https://sk.wikipedia.org/wiki/Administratívne_členenie_Slovenska
https://ru.wikipedia.org/wiki/Федеральные_округа_Российской_Федерации

Running this file through the URL detection yields https://sk.wikipedia.org/wiki/Administrativne_clenenie_Slovenska and https://ru.wikipedia.org/wiki, which both do not match the input.

How To Reproduce

  • Save the above code snippet as test.txt.

  • Open the interactive Python console and run:

    >>> from scancode import api
    >>> api.get_urls('test.txt')
    {'urls': [{'url': 'https://sk.wikipedia.org/wiki/Administrativne_clenenie_Slovenska', 'start_line': 1, 'end_line': 1}, {'url': 'https://ru.wikipedia.org/wiki', 'start_line': 2, 'end_line': 2}]}
    >>> 
    

System configuration

For bug reports, it really helps us to know:

  • What OS are you running on? (Windows/MacOS/Linux) - Linux
  • What version of scancode-toolkit was used to generate the scan file? - 32.3.3
  • What installation method was used to install/run scancode? (pip/source download/other) - pip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions