scrape-llm

Small Python scraping + LLM pipeline with strict resource/output caps. Works with Ollama (local) or Hugging Face Inference.

Lightweight web-scraper that combines text extraction, image sampling, and local LLM reasoning.
Runs fully offline with Ollama (Gemma 3, DeepSeek R1 8B, Llama 3 8B, …) or via Hugging Face Inference.
Includes a one-click Streamlit GUI.

Key features

Feature	Notes	Resource cap
Text extraction	Selectolax + Trafilatura first, LLM fills gaps	HTML trimmed to 6 kB
Image sampling	Grabs ≤ 3 images ≥ 256 px, converts to JPEG	≤ 2 MB each
Optional captions	Local vision LLM (Qwen2-VL 2B, LLaVA 7B, …)	64 tokens / image
GUI	Streamlit at http://localhost:8501	none
One-click launcher	`start_scraper.cmd`	installs/updates deps automatically

Quick start (Windows / PowerShell)

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -U pip
pip install -r requirements.txt

# Copy .env.example to .env and adjust values
copy .env.example .env
notepad .env

Quick start (Windows 11) After latest merge

git clone https://github.com/<YOUR-USER>/scrape-llm.git
cd scrape-llm

REM One-time: create .env from template and pick your model
copy .env.example .env
notepad .env            # set LLM_PROVIDER, LLM_MODEL, optional LLM_VISION_MODEL

REM One click from now on:
.\start_scraper.cmd

Next Steps

It is an interesting project that I wanted to do, as it helps understanding basic LLM use cases. Therefore, I will go forward with different iteretions and models:

Vision: llava:7b or qwen2-vl:2b
Audio output: whisper-tiny
Audio: Input: Qwen2-Audio 7B

And Try

..venv\Scripts\python.exe main.py https://www.bbc.com/news https://www.reuters.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
extractors.py		extractors.py
llm.py		llm.py
llm_vision.py		llm_vision.py
main.py		main.py
requirements.txt		requirements.txt
start_scraper.cmd		start_scraper.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scrape-llm

Key features

Quick start (Windows / PowerShell)

Quick start (Windows 11) After latest merge

Next Steps

And Try

About

Uh oh!

Releases

Packages

Languages

Philippe-Guerrier/scrape-llm

Folders and files

Latest commit

History

Repository files navigation

scrape-llm

Key features

Quick start (Windows / PowerShell)

Quick start (Windows 11) After latest merge

Next Steps

And Try

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages