Detik.com News Scraping
- Python 3.10.14
- Installed packages (use
requirements.txt
)
- Currently, this engine is only built for the detik.com website
- News scraping with input as date range and keywords search
- Optimized with asynchronous flow
- News dataset collection (including news metadata like title, published date, news keywords, authors)
- Clone repository
git clone <repository-url>
cd <repository-directory>
- Create env
conda create -n venvname python==3.10
conda activate venvname
- Prepare your environment by installing the requirements.
pip install -r requirements.txt
- Run this command on your environment terminal:
- Command:
python run_scraping.py
- Example:
python run_scraping.py --from_date="29/08/2024" --to_date="30/08/2024" --keyword=pangan
python run_scraping.py --from_date="29/08/2024" --to_date="30/08/2024" --keyword pangan,BBM,demo
- Command:
- The output will be exported into a Parquet file.