Improve urls regex

@vsoch  Once again, here is another attempt at improving our long and forgiving nice regex :smile: 

A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was. 

# Complexity, simplicity and regex visualizations
Here is a simplified  ([domain extensions](https://github.com/urlstechie/urlchecker-python/blob/3a13e4ecff17ea23a83b86c43affec1f8e1bcebb/urlchecker/core/urlmarker.py#L14) are replaced with ... except .com and .org) graph of what we have at the moment:
![image](https://user-images.githubusercontent.com/15731839/161432224-95bffd2f-0dcf-462f-ad36-aad85861cf01.png)

So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks: 
![image(2)](https://user-images.githubusercontent.com/15731839/161432613-d1ae9cb5-a97e-400a-af6c-badce790d1dd.png)

# Comparing efficiency and speed
Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1
Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally: 

```python3
import re 
import time 


domain_extensions = "".join(
    (
        "com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|",
        "jobs|mobi|museum|name|post|pro|tel|travel|xxx|",
        "ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|",
        "ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|",
        "ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|",
        "dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|",
        "fi|fj|fk|fm|fo|fr|",
        "ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|",
        "hk|hm|hn|hr|ht|hu|",
        "id|ie|il|im|in|io|iq|ir|is|it|",
        "je|jm|jo|jp|ke|kg|kh|ki|",
        "km|kn|kp|kr|kw|ky|kz|",
        "la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|",
        "ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|",
        "na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|",
        "om|",
        "pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|",
        "qa|",
        "re|ro|rs|ru|rw|",
        "sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|",
        "tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|",
        "ua|ug|uk|us|uy|uz|",
        "va|vc|ve|vg|vi|vn|vu|",
        "wf|ws|",
        "ye|yt|yu|",
        "za|zm|zw",
    )
)
    
URL_REGEX1 = "".join(
    (
        "(?i)\\b(",
        "(?:",
        "https?:(?:htt/{1,3}|[a-z0-9%]",
        ")",
        "|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
        "(?:",
        "[^\\s()<>\[\\]]+|\$[^\\s()]*?\\([^\\s()]+\$[^\\s()]*?\\)",
        "|\$[^\\s]+?\$",
        ")",
        "+",
        "(?:",
        "\$[^\\s()]*?\\([^\\s()]+\$[^\\s()]*?\\)",
        "|\$[^\\s]+?\$",
        "|[^\\s`!()\\[\\];:'\".,<>?«»“”‘’]",
        ")",
        "|",
        "(?:",
        "(?<!@)[a-z0-9]",
        "+(?:[.\\-][a-z0-9]+)*[.]",
        "(?:%s)\\b/?(?!@)" % domain_extensions,
        "))",
    )
)
CURRENT_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|$[^\s()]*?\([^\s()]+$[^\s()]*?\)|$[^\s]+?$)+(?:$[^\s()]*?\([^\s()]+$[^\s()]*?\)|$[^\s]+?$|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
NEW_REGEX = "(http[s]?:\/\/)(www\.)?([a-zA-Z0-9$-_@&+!*,\/\.]+[\.])([a-zA-Z]+)([\/\.\-\_\=?#a-zA-Z0-9@&_=:%+~]+)"

# read file content
file_path = "links.txt"
with open(file_path, "r") as file:
    content = file.read()

links  = [ l for l in content.split("\n") if "http" in l ]


# 1st regex
t01 = time.time()
for i in range(1000):
    urls0 = re.findall(URL_REGEX1, content)
t02 = time.time()
print("DT0  =", t02-t01)
print("LEN0 = ", len(urls0))


# final regex
t11 = time.time()
for i in range(1000):
    urls1 = re.findall(CURRENT_REGEX, content)
t12 = time.time()
print("DT1  =", t12-t11)
print("LEN1 = ", len(urls1))


# 2nd regex
t21   = time.time()
for i in range(1000):
    urls2 = ["".join(x) for x in re.findall(NEW_REGEX, content)]
t22   = time.time()
print("DT2  =", t22-t21)
print("LEN2 = ", len(urls2))
``` 

`links.txt` is a file with 755 urls, each on a seperate line. These urls are collected from the logs of [buildtest](https://github.com/buildtesters/buildtest)  and  [us-rse](https://github.com/USRSE/usrse.github.io). The results of the previous comparison are the following: 

``` 
DT0  = 2.3765275478363037
LEN0 =  748

DT1  = 0.7541322708129883
LEN1 =  755

DT2  = 0.6342747211456299
LEN2 =  755
``` 

As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them. 

# So what's next?
I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR :wink:  This blog post: [In search of the perfect URL validation regex](https://mathiasbynens.be/demo/url-regex) is a good inspiration. I think we rank somewhat third according to their test.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve urls regex #70

Complexity, simplicity and regex visualizations

Comparing efficiency and speed

So what's next?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve urls regex #70

Description

Complexity, simplicity and regex visualizations

Comparing efficiency and speed

So what's next?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions