Skip to content

Hybrid web crawler that intelligently renders JavaScript-heavy pages only when needed — balances content coverage and infrastructure cost using heuristic scoring + conditional Playwright rendering.

Notifications You must be signed in to change notification settings

SaiSandeepKantareddy/hybrid_crawler_js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hybrid-crawler-js

A scalable web crawler that intelligently decides when to render pages with JavaScript — saving infrastructure cost while maximizing content coverage.

Inspired by Andrew Chan’s billion-page crawler, this project adds a conditional JS rendering layer, driven by metadata heuristics like DOM sparsity, script count, and text length.

Project Structure


hybrid\_crawler\_js/
├── crawler/
│   ├── base\_fetcher.py       # Fast async HTML fetcher
│   ├── js\_renderer.py        # Playwright-based JS renderer
│   └── scorer.py             # Heuristic scoring function
├── pipeline/
│   ├── hybrid\_runner.py      # Main orchestrator
│   └── config.yaml           # Config: input file + JS score threshold
├── utils/
│   └── metrics.py            # Coverage, cost, quality metrics
├── tests/
│   └── test\_end\_to\_end.py    # Test runner
├── data/
│   ├── urls\_sample.txt       # Sample URLs to crawl
│   └── hybrid\_output.csv     # Output: content, scores, flags
└── README.md

Installation

git clone https://github.com/your-username/hybrid-crawler-js
cd hybrid-crawler-js

# Set up Python environment
pip install -r requirements.txt

# Install Playwright and browser engine
playwright install

Usage

Step 1: Add URLs to crawl

Edit data/urls_sample.txt:

https://www.airbnb.com
https://twitter.com
https://www.linkedin.com/jobs

Step 2: Adjust the config

pipeline/config.yaml:

urls_file: data/urls_sample.txt
js_threshold: 0.7

Set js_threshold to 0.0 to force JS rendering for all URLs (for testing).

Step 3: Run the hybrid crawler

python -m pipeline.hybrid_runner

This will:

  • Fetch pages using aiohttp
  • Score each page's content
  • Render with Playwright only if needed
  • Save results to data/hybrid_output.csv

Example Output

data/hybrid_output.csv:

URL used_js score error
https://airbnb.com True 0.7
https://twitter.com False 1.0 Playwright error (headers too large)
https://linkedin.com/jobs False 0.0

Metrics

You can summarize the results using:

python utils/metrics.py

Sample Output:

Summary Metrics:
        total_pages: 3
   js_rendered_pages: 1
      js_rendered_%: 33.3
           avg_score: 0.566
       error_rate_%: 33.3
 avg_content_length: 18390

How It Works

  • base_fetcher.py: Async HTTP fetch for fast crawling
  • scorer.py: Applies heuristic rules (text length, DOM density, noscript tags, etc.) to estimate whether JS is needed
  • js_renderer.py: Uses Playwright to render JS-heavy pages on demand
  • hybrid_runner.py: Orchestrates fetch → score → decide → render pipeline

Testing

Run a full end-to-end test with:

python -m tests.test_end_to_end

Verifies output file creation, scoring, and JS routing behavior.

Next Features

  • Screenshot capture (.png)
  • Parallel Playwright rendering pool
  • robots.txt support
  • Configurable scoring weights
  • Lightweight JS-needed classifier (coming)

Disclaimer

Sample URLs included in this project (e.g., Airbnb, Twitter, LinkedIn) are for educational and testing purposes only.

This project:

  • Does not store, republish, or distribute any third-party content.
  • Does not perform large-scale scraping.
  • Is intended to demonstrate technical architecture for hybrid crawling only.

Always consult a website's Terms of Service and robots.txt before crawling at scale.

Acknowledgements

Built as an open-source proof-of-concept following the brilliant crawler performance post by Andrew Chan.

Questions or Suggestions?

Open an issue or drop a message.

About

Hybrid web crawler that intelligently renders JavaScript-heavy pages only when needed — balances content coverage and infrastructure cost using heuristic scoring + conditional Playwright rendering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages