Automatic Web Scraper

This project is an automatic web scraper that uses the LLM Ollama gpt-oss:20b to parse the body content of a web page. The application is built using Streamlit for the user interface and various Python libraries for web scraping and parsing.

Features

🚀 Advanced Web Scraping: Scrape the body content of any web page with improved error handling
🧹 Smart Content Cleaning: Clean the scraped content by removing scripts, styles, and unwanted elements
📊 Intelligent Chunking: Split large content into manageable chunks for processing
🤖 AI-Powered Parsing: Parse content using the powerful Ollama gpt-oss:20b model
📈 Real-time Progress: Track scraping and parsing progress with visual indicators
💾 Export Results: Download parsed results as text files
⚙️ Configurable Settings: Adjust chunk sizes and processing parameters

New in This Version

✨ Updated to use Ollama gpt-oss:20b model for better performance
🛡️ Enhanced error handling and logging
🎨 Improved user interface with better feedback
📱 Responsive design with sidebar configuration
🔧 Modular code structure with separate config and utility files
📊 Content statistics and processing metrics
🌐 Better URL validation and domain extraction

Demo

Installation

Prerequisites

Python 3.8 or higher
Ollama installed with gpt-oss:20b model
Chrome browser (ChromeDriver will be downloaded automatically)

Install Ollama and the model:

# Install Ollama (macOS)
brew install ollama

# Pull the gpt-oss:20b model
ollama pull gpt-oss:20b

Create a virtual environment:

python -m venv ai

Activate the virtual environment:

On macOS and Linux:

source ai/bin/activate

On Windows:

.\ai\Scripts\activate

Installing dependencies:

pip install -r requirements.txt

Setup ChromeDriver (automatic):

# Optional: Run ChromeDriver setup utility to verify compatibility
python setup_chromedriver.py

Running the Application

Make sure Ollama is running:

ollama serve

Activate the virtual environment (if not already activated):

On macOS and Linux:

source ai/bin/activate

On Windows:

.\ai\Scripts\activate

Run the Streamlit application:

streamlit run main.py

Usage

🌐 Enter URL: Input the URL of the website you want to scrape
⚙️ Configure Settings: Adjust chunk size in the sidebar (optional)
🤳 Scrape Website: Click "Scrape Website" to extract content
👁️ Review Content: View the extracted DOM content in the expander
📝 Describe Parsing: Describe what specific information you want to extract
🚀 Parse Content: Click "Parse Content" to process with AI
📊 View Results: Review the extracted information
💾 Download: Save results as a text file (optional)

Examples of Parse Descriptions

"Extract all email addresses"
"Find product names and prices"
"Get all phone numbers and contact information"
"Extract article titles and publication dates"
"Find all social media links"

Configuration

You can modify settings in config.py:

Model Settings: Change Ollama model, temperature, and prediction limits
Scraping Settings: Adjust wait times, browser settings, and chunk sizes
UI Settings: Customize page title, icons, and layout

Project Structure

├── main.py                # Main Streamlit application
├── scrape.py             # Web scraping functionality
├── parse.py              # AI parsing with Ollama
├── config.py             # Configuration settings
├── utils.py              # Utility functions
├── setup_chromedriver.py # ChromeDriver setup utility
├── requirements.txt      # Python dependencies
└── README.md            # Documentation

Dependencies

streamlit: Web application framework
langchain & langchain_ollama: LLM integration
selenium: Web browser automation
webdriver-manager: Automatic ChromeDriver management
beautifulsoup4: HTML parsing
lxml & html5lib: XML/HTML processing
python-dotenv: Environment variable management
requests & urllib3: HTTP libraries

Troubleshooting

Common Issues:

ChromeDriver version mismatch: The app now automatically downloads the correct ChromeDriver version
- If you get ChromeDriver errors, run: python setup_chromedriver.py
- This will download and test the compatible ChromeDriver for your Chrome version
Ollama model not available: Run ollama pull gpt-oss:20b
Connection errors: Check internet connection and URL validity
Memory issues: Reduce chunk size in sidebar settings

ChromeDriver Setup:

The project now includes automatic ChromeDriver management using webdriver-manager. If you encounter ChromeDriver compatibility issues:

# Run the ChromeDriver setup utility
python setup_chromedriver.py

This utility will:

✅ Detect your Chrome browser version
📥 Download the compatible ChromeDriver automatically
🧪 Test the ChromeDriver to ensure it works
📋 Provide detailed status information

Performance Tips:

Use smaller chunk sizes for faster processing
Enable headless browsing for better performance
Close unnecessary browser tabs to free memory
The ChromeDriver is automatically cached for faster subsequent runs

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
__pycache__		__pycache__
ai		ai
.DS_Store		.DS_Store
README.md		README.md
ScreenRecording.gif		ScreenRecording.gif
config.py		config.py
main.py		main.py
memory_optimizer.py		memory_optimizer.py
model_manager.py		model_manager.py
optimize_ollama_memory.sh		optimize_ollama_memory.sh
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py
setup_chromedriver.py		setup_chromedriver.py
test_scraper.py		test_scraper.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Web Scraper

Features

New in This Version

Demo

Installation

Prerequisites

Install Ollama and the model:

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Setup ChromeDriver (automatic):

Running the Application

Usage

Examples of Parse Descriptions

Configuration

Project Structure

Dependencies

Troubleshooting

Common Issues:

ChromeDriver Setup:

Performance Tips:

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RockENZO/Automatic-web-scraper-with-LLM-parsing

Folders and files

Latest commit

History

Repository files navigation

Automatic Web Scraper

Features

New in This Version

Demo

Installation

Prerequisites

Install Ollama and the model:

Create a virtual environment:

Activate the virtual environment:

Installing dependencies:

Setup ChromeDriver (automatic):

Running the Application

Usage

Examples of Parse Descriptions

Configuration

Project Structure

Dependencies

Troubleshooting

Common Issues:

ChromeDriver Setup:

Performance Tips:

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages