This repository contains practical examples of website link collection using Python and Node.js. It covers different methods: from basic sitemap parsing with requests to crawling entire websites and scraping Google SERPs with HasData’s API.
Python 3.10+ or Node.js 18+
Required packages:
- requests
Install:
pip install requestsRequired packages:
- axios
Install:
npm install axiosweb-scraping-examples/
│
├── python/
│   ├── sitemap_scraper_requests.py
│   ├── sitemap_scraper_hasdata.py
│   ├── crawler_hasdata.py
│   ├── crawler_ai_hasdata.py
│   ├── google_serp_scraper_hasdata.py
│
├── nodejs/
│   ├── sitemap_scraper_requests.js
│   ├── sitemap_scraper_hasdata.js
│   ├── crawler_hasdata.js
│   ├── crawler_ai_hasdata.js
│   ├── google_serp_scraper_hasdata.js
│
└── README.md
Each script is focused on a specific use case. No frameworks. Just clean and minimal examples to get things done.
Read full article about scraping URLs from any website.
A basic script that fetches and parses a sitemap XML using requests and xml.etree.ElementTree. No external services involved. Good for simple sites with clean sitemaps.
Change this data:
| Parameter | Description | Example | 
|---|---|---|
| sitemap_url | URL of the sitemap to scrape | 'https://demo.nopcommerce.com/sitemap.xml' | 
| output_file | File name to save links | 'sitemap_links.txt' | 
Uses HasData's API to process a sitemap and extract links. Easier to scale, works even if the sitemap is large or spread across multiple files.
Change this data:
| Parameter | Description | Example | 
|---|---|---|
| API_KEY | Your HasData API key | '111-1111-11-1' | 
| sitemapUrl | URL of the sitemap to scrape | 'https://demo.nopcommerce.com/sitemap.xml' | 
Launches a full crawl of a website using HasData’s crawler. Useful when the sitemap is missing or incomplete. Returns all discovered URLs.
Change this data:
| Parameter | Description | Example | 
|---|---|---|
| API_KEY | Your HasData API key | '111-1111-11-1' | 
| payload.limit | Max number of links to collect | 20 | 
| payload.urls | List of URLs to crawl | ['https://demo.nopcommerce.com'] | 
| output_path | Filename to save the collected URLs | 'results_<job_id>.json' | 
Same as above, but adds AI-powered content extraction. You can define what kind of data you want from each page using aiExtractRules. Great for structured scraping.
Change this data:
| Parameter | Description | Example | 
|---|---|---|
| API_KEY | Your HasData API key | '111-1111-11-1' | 
| urls | List of URLs to crawl | ["https://example.com"] | 
| limit | Max number of pages to crawl | 20 | 
| aiExtractRules | JSON schema for AI content parsing | See script | 
| outputFormat | Desired output format(s) | ["json", "text"] | 
Sends a search query to HasData and gets back links from Google search results. No browser automation needed. Simple and fast way to collect SERP data.
Change this data:
| Parameter | Description | Example | 
|---|---|---|
| api_key | Your HasData API key | 'YOUR-API-KEY' | 
| query | Search query for Google | 'site:hasdata.com inurl:blog' | 
| location | Search location | 'Austin,Texas,United States' | 
| deviceType | Device type for search | 'desktop' | 
| num_results | Number of results to fetch | 100 | 
