GitHub - ari-r-1/data-extraction-and-NLP-text-analysis-: This project automates the extraction and analysis of article content from a list of URLs provided in `Input.xlsx`. It processes the textual content using NLP techniques to compute sentiment, readability, and other linguistic metrics, and exports the final results to a structured Excel file.

🧠 Data Extraction And NLP Text Analysis

📌 Objective

This project automates the extraction and analysis of article content from a list of URLs provided in Input.xlsx. It processes the text and calculates key sentiment and readability metrics. The results are exported to an Excel file, structured as per Output Data Structure.xlsx.

🧾 Input Files

Please ensure the following files are available in the same directory before running the script:

Input.xlsx – List of articles with URL_ID and URL
positive-words.txt – Positive words list (from MasterDictionary)
negative-words.txt – Negative words list (from MasterDictionary)
StopWords/ – A folder containing all stopword .txt files such as:
- StopWords_Generic.txt

Required core packages

pandas # For reading/writing Excel and handling data openpyxl # For reading/writing .xlsx files with pandas nltk # For tokenization and stopword handling beautifulsoup4 # For parsing HTML content from web pages requests # For sending HTTP requests to URLs

import nltk nltk.download('punkt')

pip install pandas openpyxl nltk beautifulsoup4 requests

---

## 📊 Output

The output Excel file will follow the format defined in Output Data Structure.xlsx, containing:
- Cleaned text
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (syllables, average sentence length, FOG index)
- Word and sentence counts
- Complex word analysis

---

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
stopwords		stopwords
Input.xlsx		Input.xlsx
Instructions.txt		Instructions.txt
Output.xlsx		Output.xlsx
README.md		README.md
data_extraction_and_nlp_analyzer.py		data_extraction_and_nlp_analyzer.py
data_extraction_and_nlp_analyzer_colab.ipynb		data_extraction_and_nlp_analyzer_colab.ipynb
data_extraction_and_nlp_tkinder.py		data_extraction_and_nlp_tkinder.py
negative-words.txt		negative-words.txt
positive-words.txt		positive-words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Data Extraction And NLP Text Analysis

📌 Objective

🧾 Input Files

Required core packages

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ari-r-1/data-extraction-and-NLP-text-analysis-

Folders and files

Latest commit

History

Repository files navigation

🧠 Data Extraction And NLP Text Analysis

📌 Objective

🧾 Input Files

Required core packages

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages