First project i made for fun lol :D
A super-efficient, async URL scanner that checks thousands of URLs from a file (.csv/.txt) for dead links (404s) This program can be used to clean a large list of URLs from broken ones.
- Async scanning with concurrency limit (default 15) for fast performance
- Domain filtering — scan all URLs or can target specific domains only from the file
- File cleaning — automatically remove dead URLs from your file (Optional) [BETA]
- Backup system — backs up your original CSV before cleaning
- Scan-Report Produced — shows all dead Urls found
- Supports most Url's format — http://, https://, www.
- Stealth headers — mimics real browser requests for better detection bypassing most Web-security-bots
- Progress bar — live scan progress displayed with tqdm
- Platform — works on Windows, Linux (Not tested on MacOS)
- Optimized for low hardware — Scans 1000 links in 5 mins (Tested on linux | i5-3rd gen | 4GB ram)
SKIP THIS STEP IF : you do not want to clean your file from dead urls and only want a scan and a Scan report generated
If you are running this on Windows, disable the "Controlled Folder Access" feature in Windows Defender before running the cleaning step.
Steps:
- Open Windows Security
- Go to virus & threat protection
- Click Manage ransomware protection
- Turn off Controlled Folder Access You may re-enable this after the scan is completed
-
Clone or download this repo
-
Run
pip install -r requirements.txt
ORpy -m pip install -r requirements.txt
ORpython -m pip install -r requirements.txt
to install dependencies 3.⚠️ runplaywright install
orpy -m playwright install
⚠️ -
Start the scanner :
py DeadURL.py
or
python DeadURL.py
-
When prompted:
- Drop or enter the full path to your CSV file
- Optionally scan a specified domain (or leave blank to scan all)
- Choose whether to remove dead URLs from the CSV after scan (y/n)
-
Wait for the scan to complete
-
Check the generated
scan_results_YYYY-MM-DD_HH-MM-SS.txt
file for 404 errors -
If cleaning was enabled, your original CSV will be backed up and cleaned of dead URLs
- Concurrency limit can be adjusted by changing
CONCURRENT_LIMIT
at the top of the script for faster/slower scanning depending on your hardware/network - The User-Agent and headers are set in
STEALTH_HEADERS
— you can update to mimic different browsers or add custom headers if needed - The script uses Playwright to simulate browser requests for better accuracy over simple HTTP requests
- URLs are sanitized to only accept those starting with
http
,https
, orwww.
— can be tweaked insidesanitize_url()
- File cleaning will remove only the dead URLs and save a backup automatically
- The file should contain URLs, ideally with a column named
API_Name
(optional) - URLs can start with
http://
,https://
, orwww.
- URLs can be separated by commas, spaces, or new lines inside cells
- Python 3.7+
- Playwright
- pandas
- tqdm
- requests
MIT-License
Pull requests and issues are welcome!
This project is still under development so issues may arise :(