Skip to content

Improve urls regexΒ #70

@SuperKogito

Description

@SuperKogito

@vsoch Once again, here is another attempt at improving our long and forgiving nice regex πŸ˜„

A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was.

Complexity, simplicity and regex visualizations

Here is a simplified (domain extensions are replaced with ... except .com and .org) graph of what we have at the moment:
image

So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks:
image(2)

Comparing efficiency and speed

Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1
Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally:

import re 
import time 


domain_extensions = "".join(
    (
        "com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|",
        "jobs|mobi|museum|name|post|pro|tel|travel|xxx|",
        "ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|",
        "ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|",
        "ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|",
        "dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|",
        "fi|fj|fk|fm|fo|fr|",
        "ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|",
        "hk|hm|hn|hr|ht|hu|",
        "id|ie|il|im|in|io|iq|ir|is|it|",
        "je|jm|jo|jp|ke|kg|kh|ki|",
        "km|kn|kp|kr|kw|ky|kz|",
        "la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|",
        "ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|",
        "na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|",
        "om|",
        "pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|",
        "qa|",
        "re|ro|rs|ru|rw|",
        "sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|",
        "tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|",
        "ua|ug|uk|us|uy|uz|",
        "va|vc|ve|vg|vi|vn|vu|",
        "wf|ws|",
        "ye|yt|yu|",
        "za|zm|zw",
    )
)
    
URL_REGEX1 = "".join(
    (
        "(?i)\\b(",
        "(?:",
        "https?:(?:htt/{1,3}|[a-z0-9%]",
        ")",
        "|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
        "(?:",
        "[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        ")",
        "+",
        "(?:",
        "\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        "|[^\\s`!()\\[\\];:'\".,<>?Β«Β»β€œβ€β€˜β€™]",
        ")",
        "|",
        "(?:",
        "(?<!@)[a-z0-9]",
        "+(?:[.\\-][a-z0-9]+)*[.]",
        "(?:%s)\\b/?(?!@)" % domain_extensions,
        "))",
    )
)
CURRENT_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?Β«Β»β€œβ€β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
NEW_REGEX = "(http[s]?:\/\/)(www\.)?([a-zA-Z0-9$-_@&+!*\(\),\/\.]+[\.])([a-zA-Z]+)([\/\.\-\_\=?#a-zA-Z0-9@&_=:%+~\(\)]+)"

# read file content
file_path = "links.txt"
with open(file_path, "r") as file:
    content = file.read()

links  = [ l for l in content.split("\n") if "http" in l ]


# 1st regex
t01 = time.time()
for i in range(1000):
    urls0 = re.findall(URL_REGEX1, content)
t02 = time.time()
print("DT0  =", t02-t01)
print("LEN0 = ", len(urls0))


# final regex
t11 = time.time()
for i in range(1000):
    urls1 = re.findall(CURRENT_REGEX, content)
t12 = time.time()
print("DT1  =", t12-t11)
print("LEN1 = ", len(urls1))


# 2nd regex
t21   = time.time()
for i in range(1000):
    urls2 = ["".join(x) for x in re.findall(NEW_REGEX, content)]
t22   = time.time()
print("DT2  =", t22-t21)
print("LEN2 = ", len(urls2))

links.txt is a file with 755 urls, each on a seperate line. These urls are collected from the logs of buildtest and us-rse. The results of the previous comparison are the following:

DT0  = 2.3765275478363037
LEN0 =  748

DT1  = 0.7541322708129883
LEN1 =  755

DT2  = 0.6342747211456299
LEN2 =  755

As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them.

So what's next?

I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR πŸ˜‰ This blog post: In search of the perfect URL validation regex is a good inspiration. I think we rank somewhat third according to their test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussing features, implementations and enhancementsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions