This project aims to analyse consumer purchasing behaviour by exploring user reviews from Amazonβs Electronics, Computers category between 2022 and 2023.
The dataset, sourced from McAuley Labβs Amazon Reviews 2023 collection, includes over 571M reviews. Leveraging the Databricks environment with Apache Spark as the Big Data processing engine and Transformer-based models, we performed sentiment and topic analysis, clustering of product reviews, and graph analysis to understand the competitiveness between products within this e-commerce sector and analyse consumer behaviour.
Throughout the project, we relied on PySpark as our core API to interact with Spark components such as Spark SQL, MLlib, Streaming, and GraphFrames.
The primary objectives of this project are to:
- Extract and process data from Amazon Reviews '23, specifically targeting Computer Products in the Electronics category.
- Conduct sentiment analysis utilizing pre-trained Transformer-based models (TX-RoBERTa and mDeBERTa) to classify the sentiment expressed in user reviews.
- Identify key trends by uncovering trending topics and themes within computer product reviews to understand consumer preferences and concerns.
- Perform clustering analysis on product reviews to identify distinct segments.
- Conduct graph analysis (PageRank, Label Propagation) to understand user-product interactions, influential nodes, and communities.
- Simulate and process streaming data using Spark Streaming.
- Provide actionable insights by synthesizing findings into meaningful insights for businesses and stakeholders.
This project was developed for the Big Data Analytics course in the Master's in Data Science and Advanced Analytics program at NOVA IMS, during the 2nd Semester of the 2024-2025 academic year.
The project followed the CRISP-DM methodology, executed within a distributed computing environment.
Figure 1: Project Flowchart.
-
Business & Data Understanding (Notebook
0_DataCollection&1_BU&EDA): π‘- Problem Definition: Analyze consumer purchasing behaviour and product competitiveness in the Amazon Electronics (Computers) sector using reviews from 2022-2023.
- Data Source: Amazon Reviews 2023 dataset (Electronics category).
- Initial Exploration: Understand dataset attributes, size, and structure. Identify key data fields for reviews and product metadata.
-
Data Collection & Preprocessing (Notebook
0_DataCollection&1_BU&EDA): βοΈ- Obtain and load the large dataset into the Databricks environment.
- Clean and preprocess the data using PySpark: handle missing values, convert data types (e.g., timestamp), filter relevant data (reviews from 2022-2023, 'Computers' category).
- Feature engineering: e.g., creating a combined text field from review title and body.
-
Sentiment & Topic Analysis (Notebook
2_TextAnalysis): π£οΈ- Apply pre-trained Transformer models (TX-RoBERTa and mDeBERTa) for sentiment classification (Positive, Neutral, Negative) on review text.
- Combine outputs from both models to create a robust sentiment score.
- Perform topic analysis (e.g., using mDeBERTa zero-shot classification) on product titles to identify product categories.
- Analyze sentiment distribution across different topics/products.
-
Streaming Simulation (Notebook
3.1_ModellingStreamingSimulation): π- Simulate a streaming scenario by processing review data in chunks.
- Apply sentiment analysis models to incoming data streams using Spark Streaming.
- Demonstrate capabilities for near real-time processing and analysis.
-
Clustering Analysis (Notebook
5_ClusteringAnalysis): clust.- Perform clustering (e.g., K-Means) on product reviews based on selected features (e.g., TF-IDF of review text, sentiment scores, product metadata).
- Identify and profile distinct customer/product segments.
- (Self-correction based on OCR'd pages: The clustering appears to be based on product features and review sentiment, aiming to group products or understand review patterns rather than customer segmentation directly from review text alone). PCA was explored.
-
Graph Analysis (Notebook
6_GraphAnalysis): π- Construct a graph of user-product interactions using GraphFrames.
- Apply PageRank algorithm to identify influential products and users.
- Use Label Propagation algorithm to detect communities within the review network.
-
Results Analysis & Visualization (Notebook
4_ResultsAnalysisand throughout): ππ- Analyze results from sentiment, topic, clustering, and graph analyses.
- Visualize findings using dashboards, charts, and tables (e.g., sentiment distribution, topic trends, cluster profiles, PageRank distributions, community structures).
- Synthesize insights to address the project's objectives.
- Successfully processed and analyzed a large subset of the Amazon Reviews dataset.
- Applied advanced NLP models for sentiment and topic classification in a distributed environment.
- Identified influential products and user communities through graph analysis.
- The clustering solution achieved an RΒ² of 0.80, indicating a strong separation and clear definition between the identified product/review clusters.
This project demonstrates a robust, end-to-end Big Data analytics pipeline. The combination of Spark, advanced ML models, and graph analytics provided deep insights into consumer behavior and product competitiveness on Amazon.
Future work could include:
- Building predictive models for sales forecasting based on review trends.
- Developing a product recommendation system using the clustering and graph analysis results.
- Expanding the analysis to other product categories or timeframes.
Dive into our notebooks to see the data magic unfold! πͺ Fair warning: your Amazon cart might just start recommending itself... ππΈ
The project is organized into several Databricks notebooks, each focusing on a specific stage of the pipeline. Pre-executed HTML versions are available in the repository for easy viewing.
- Data Collection & Initial Setup
- Business and Data Understanding & Exploratory Data Analysis (EDA)
- Text Analysis (WordCloud, Number of Important Words)
- Modelling (Sentiment & Topic) & Streaming Simulation
- Results Analysis and Visualization
- Clustering Analysis
- Graph Analysis
Note: All notebooks were developed and executed in the Databricks environment (except
3_Modelling_BDAProject_Group37_Colab.ipynb). If you prefer not to run the notebooks interactively, you can find pre-executed HTML versions of each notebook, complete with their respective results, within the repository. This allows for easy viewing of the analysis and outputs without requiring a Databricks setup.
Dive into our notebooks to see the data magic unfold! πͺ But fair warning: your Amazon cart might start recommending itself after this... ππΈ
- AndrΓ© Silvestre (20240502)
- Filipa Pereira (20240509)
- JoΓ£o Henriques (20240499)
- Umeima Mahomed (20240543)
