This project is an automatic web scraper that uses the LLM Ollama gpt-oss:20b to parse the body content of a web page. The application is built using Streamlit for the user interface and various Python libraries for web scraping and parsing.
- 🚀 Advanced Web Scraping: Scrape the body content of any web page with improved error handling
- 🧹 Smart Content Cleaning: Clean the scraped content by removing scripts, styles, and unwanted elements
- 📊 Intelligent Chunking: Split large content into manageable chunks for processing
- 🤖 AI-Powered Parsing: Parse content using the powerful Ollama gpt-oss:20b model
- 📈 Real-time Progress: Track scraping and parsing progress with visual indicators
- 💾 Export Results: Download parsed results as text files
- ⚙️ Configurable Settings: Adjust chunk sizes and processing parameters
- ✨ Updated to use Ollama gpt-oss:20b model for better performance
- 🛡️ Enhanced error handling and logging
- 🎨 Improved user interface with better feedback
- 📱 Responsive design with sidebar configuration
- 🔧 Modular code structure with separate config and utility files
- 📊 Content statistics and processing metrics
- 🌐 Better URL validation and domain extraction
- Python 3.8 or higher
- Ollama installed with gpt-oss:20b model
- Chrome browser (ChromeDriver will be downloaded automatically)
# Install Ollama (macOS)
brew install ollama
# Pull the gpt-oss:20b model
ollama pull gpt-oss:20b
python -m venv ai
- On macOS and Linux:
source ai/bin/activate
- On Windows:
.\ai\Scripts\activate
pip install -r requirements.txt
# Optional: Run ChromeDriver setup utility to verify compatibility
python setup_chromedriver.py
- Make sure Ollama is running:
ollama serve
- Activate the virtual environment (if not already activated):
- On macOS and Linux:
source ai/bin/activate
- On Windows:
.\ai\Scripts\activate
- Run the Streamlit application:
streamlit run main.py
- 🌐 Enter URL: Input the URL of the website you want to scrape
- ⚙️ Configure Settings: Adjust chunk size in the sidebar (optional)
- 🤳 Scrape Website: Click "Scrape Website" to extract content
- 👁️ Review Content: View the extracted DOM content in the expander
- 📝 Describe Parsing: Describe what specific information you want to extract
- 🚀 Parse Content: Click "Parse Content" to process with AI
- 📊 View Results: Review the extracted information
- 💾 Download: Save results as a text file (optional)
- "Extract all email addresses"
- "Find product names and prices"
- "Get all phone numbers and contact information"
- "Extract article titles and publication dates"
- "Find all social media links"
You can modify settings in config.py
:
- Model Settings: Change Ollama model, temperature, and prediction limits
- Scraping Settings: Adjust wait times, browser settings, and chunk sizes
- UI Settings: Customize page title, icons, and layout
├── main.py # Main Streamlit application
├── scrape.py # Web scraping functionality
├── parse.py # AI parsing with Ollama
├── config.py # Configuration settings
├── utils.py # Utility functions
├── setup_chromedriver.py # ChromeDriver setup utility
├── requirements.txt # Python dependencies
└── README.md # Documentation
- streamlit: Web application framework
- langchain & langchain_ollama: LLM integration
- selenium: Web browser automation
- webdriver-manager: Automatic ChromeDriver management
- beautifulsoup4: HTML parsing
- lxml & html5lib: XML/HTML processing
- python-dotenv: Environment variable management
- requests & urllib3: HTTP libraries
- ChromeDriver version mismatch: The app now automatically downloads the correct ChromeDriver version
- If you get ChromeDriver errors, run:
python setup_chromedriver.py
- This will download and test the compatible ChromeDriver for your Chrome version
- If you get ChromeDriver errors, run:
- Ollama model not available: Run
ollama pull gpt-oss:20b
- Connection errors: Check internet connection and URL validity
- Memory issues: Reduce chunk size in sidebar settings
The project now includes automatic ChromeDriver management using webdriver-manager
. If you encounter ChromeDriver compatibility issues:
# Run the ChromeDriver setup utility
python setup_chromedriver.py
This utility will:
- ✅ Detect your Chrome browser version
- 📥 Download the compatible ChromeDriver automatically
- 🧪 Test the ChromeDriver to ensure it works
- 📋 Provide detailed status information
- Use smaller chunk sizes for faster processing
- Enable headless browsing for better performance
- Close unnecessary browser tabs to free memory
- The ChromeDriver is automatically cached for faster subsequent runs
This project is licensed under the MIT License. See the LICENSE file for more details.