Skip to content

πŸ›οΈ Big Data project analyzing Amazon tech reviews using Databricks, PySpark, and Transformers. Includes sentiment/topic analysis, clustering, and graph analysis (PageRank). Developed for the BDA course at NOVA IMS.

License

Notifications You must be signed in to change notification settings

Silvestre17/BDA_AmazonReviews_DatabricksPySparkAnalysis_MasterProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›οΈ Modelling Amazon Tech Reviews for Consumer Insight πŸ’»οΈ

Amazon Reviews Project Banner

GitHub Repo

πŸ“ Description

This project aims to analyse consumer purchasing behaviour by exploring user reviews from Amazon’s Electronics, Computers category between 2022 and 2023.

The dataset, sourced from McAuley Lab’s Amazon Reviews 2023 collection, includes over 571M reviews. Leveraging the Databricks environment with Apache Spark as the Big Data processing engine and Transformer-based models, we performed sentiment and topic analysis, clustering of product reviews, and graph analysis to understand the competitiveness between products within this e-commerce sector and analyse consumer behaviour.

Throughout the project, we relied on PySpark as our core API to interact with Spark components such as Spark SQL, MLlib, Streaming, and GraphFrames.

✨ Objective

The primary objectives of this project are to:

  • Extract and process data from Amazon Reviews '23, specifically targeting Computer Products in the Electronics category.
  • Conduct sentiment analysis utilizing pre-trained Transformer-based models (TX-RoBERTa and mDeBERTa) to classify the sentiment expressed in user reviews.
  • Identify key trends by uncovering trending topics and themes within computer product reviews to understand consumer preferences and concerns.
  • Perform clustering analysis on product reviews to identify distinct segments.
  • Conduct graph analysis (PageRank, Label Propagation) to understand user-product interactions, influential nodes, and communities.
  • Simulate and process streaming data using Spark Streaming.
  • Provide actionable insights by synthesizing findings into meaningful insights for businesses and stakeholders.

πŸ“š Context

This project was developed for the Big Data Analytics course in the Master's in Data Science and Advanced Analytics program at NOVA IMS, during the 2nd Semester of the 2024-2025 academic year.

Amazon Reviews 2023 Dataset Filtered Dataset

πŸ—οΈ Project Workflow & Technologies

The project followed the CRISP-DM methodology, executed within a distributed computing environment.

Project Flowchart

Figure 1: Project Flowchart.

  1. Business & Data Understanding (Notebook 0_DataCollection & 1_BU&EDA): πŸ’‘

    • Problem Definition: Analyze consumer purchasing behaviour and product competitiveness in the Amazon Electronics (Computers) sector using reviews from 2022-2023.
    • Data Source: Amazon Reviews 2023 dataset (Electronics category).
    • Initial Exploration: Understand dataset attributes, size, and structure. Identify key data fields for reviews and product metadata.

    Python Databricks Apache Spark PySpark

  2. Data Collection & Preprocessing (Notebook 0_DataCollection & 1_BU&EDA): βš™οΈ

    • Obtain and load the large dataset into the Databricks environment.
    • Clean and preprocess the data using PySpark: handle missing values, convert data types (e.g., timestamp), filter relevant data (reviews from 2022-2023, 'Computers' category).
    • Feature engineering: e.g., creating a combined text field from review title and body.

    PySpark Spark SQL Pandas Matplotlib Seaborn MLlib Wordcloud

  3. Sentiment & Topic Analysis (Notebook 2_TextAnalysis): πŸ—£οΈ

    • Apply pre-trained Transformer models (TX-RoBERTa and mDeBERTa) for sentiment classification (Positive, Neutral, Negative) on review text.
    • Combine outputs from both models to create a robust sentiment score.
    • Perform topic analysis (e.g., using mDeBERTa zero-shot classification) on product titles to identify product categories.
    • Analyze sentiment distribution across different topics/products.

    Hugging Face Transformers TX-RoBERTa mDeBERTa

  4. Streaming Simulation (Notebook 3.1_ModellingStreamingSimulation): 🌊

    • Simulate a streaming scenario by processing review data in chunks.
    • Apply sentiment analysis models to incoming data streams using Spark Streaming.
    • Demonstrate capabilities for near real-time processing and analysis.

    Spark Streaming Databricks

  5. Clustering Analysis (Notebook 5_ClusteringAnalysis): clust.

    • Perform clustering (e.g., K-Means) on product reviews based on selected features (e.g., TF-IDF of review text, sentiment scores, product metadata).
    • Identify and profile distinct customer/product segments.
    • (Self-correction based on OCR'd pages: The clustering appears to be based on product features and review sentiment, aiming to group products or understand review patterns rather than customer segmentation directly from review text alone). PCA was explored.

    K-Means Clustering PCA

  6. Graph Analysis (Notebook 6_GraphAnalysis): πŸ”—

    • Construct a graph of user-product interactions using GraphFrames.
    • Apply PageRank algorithm to identify influential products and users.
    • Use Label Propagation algorithm to detect communities within the review network.

    GraphFrames PageRank Label Propagation

  7. Results Analysis & Visualization (Notebook 4_ResultsAnalysis and throughout): πŸ“ŠπŸ“ˆ

    • Analyze results from sentiment, topic, clustering, and graph analyses.
    • Visualize findings using dashboards, charts, and tables (e.g., sentiment distribution, topic trends, cluster profiles, PageRank distributions, community structures).
    • Synthesize insights to address the project's objectives.

    Databricks Matplotlib Seaborn

πŸ“ˆ Key Results

  • Successfully processed and analyzed a large subset of the Amazon Reviews dataset.
  • Applied advanced NLP models for sentiment and topic classification in a distributed environment.
  • Identified influential products and user communities through graph analysis.
  • The clustering solution achieved an RΒ² of 0.80, indicating a strong separation and clear definition between the identified product/review clusters.

πŸ“š Conclusion & Future Work

This project demonstrates a robust, end-to-end Big Data analytics pipeline. The combination of Spark, advanced ML models, and graph analytics provided deep insights into consumer behavior and product competitiveness on Amazon.

Future work could include:

  • Building predictive models for sales forecasting based on review trends.
  • Developing a product recommendation system using the clustering and graph analysis results.
  • Expanding the analysis to other product categories or timeframes.

Dive into our notebooks to see the data magic unfold! πŸͺ„ Fair warning: your Amazon cart might just start recommending itself... πŸ›’πŸ’Έ

πŸ“‚ Notebooks Structure

The project is organized into several Databricks notebooks, each focusing on a specific stage of the pipeline. Pre-executed HTML versions are available in the repository for easy viewing.

  1. Data Collection & Initial Setup
  2. Business and Data Understanding & Exploratory Data Analysis (EDA)
  3. Text Analysis (WordCloud, Number of Important Words)
  4. Modelling (Sentiment & Topic) & Streaming Simulation
  5. Results Analysis and Visualization
  6. Clustering Analysis
  7. Graph Analysis

Note: All notebooks were developed and executed in the Databricks environment (except 3_Modelling_BDAProject_Group37_Colab.ipynb). If you prefer not to run the notebooks interactively, you can find pre-executed HTML versions of each notebook, complete with their respective results, within the repository. This allows for easy viewing of the analysis and outputs without requiring a Databricks setup.

Dive into our notebooks to see the data magic unfold! πŸͺ„ But fair warning: your Amazon cart might start recommending itself after this... πŸ›’πŸ’Έ


πŸ‘₯ Team (Group 37)

  • AndrΓ© Silvestre (20240502)
  • Filipa Pereira (20240509)
  • JoΓ£o Henriques (20240499)
  • Umeima Mahomed (20240543)

About

πŸ›οΈ Big Data project analyzing Amazon tech reviews using Databricks, PySpark, and Transformers. Includes sentiment/topic analysis, clustering, and graph analysis (PageRank). Developed for the BDA course at NOVA IMS.

Topics

Resources

License

Stars

Watchers

Forks