Predicting second‑hand car prices with classic tabular ML.
Data  402 006 rows · 12 columns (target =price).
Models  Linear Regression · Random Forest · Gradient Boosting · Voting Ensemble.
- Project motivation
 - Data
 - Quick start
 - Notebook & code guide
 - Results at a glance
 - Model interpretation
 - Directory layout
 
Buying a used car is a price‑sensitive decision.
The goal is to build transparent, reproducible baselines that predict price
given mileage, age, fuel type and a handful of categorical descriptors.
Grades in the coursework are not the focus; clean code and solid discussion are.
- Source AutoTrader extract supplied by Manchester Metropolitan University.
The licence prohibits redistribution, so the CSV is not committed to this repository. - Rows 402 006 Columns 12 (all except 
priceused as predictors). - Cleaning steps
- Trim outliers in 
mileage&pricevia 1.5 × IQR. - Drop cars registered before 1975.
 - Mode‑impute gaps in 
fuel_type,body_type,standard_colour. 
 - Trim outliers in 
 - Engineered features
vehicle_age=2024 – year_of_registrationmileage_to_age_ratio=mileage / vehicle_age
 
See
notebooks/01_autotrader_walkthrough.ipynbfor the exact code.
# clone repo
git clone https://github.com/hamzahassan9320/autotrader-price-regression.git
cd autotrader-price-regression
# place the CSV in the expected location
mkdir -p data
cp /path/to/Adverts.csv data/
# set up environment
conda create -n autotrader-price python=3.10
conda activate autotrader-price
pip install -r requirements.txt
# full pipeline
python -m src.train --csv data/Adverts.csv
# run the Streamlit app locally
streamlit run app.pyTested with Python 3.10 and scikit‑learn 1.3.2.
| file | purpose | 
|---|---|
notebooks/01_autotrader_walkthrough.ipynb | 
data snapshot, EDA, demos | 
src/data.py | 
load + cleanse CSV | 
src/features.py | 
feature engineering & preprocessing | 
src/models.py | 
pipelines · param grids · grid‑search helper | 
src/train.py | 
one‑shot CLI training run; saves models & plots | 
src/visualise.py | 
regenerates figures in docs/images/ | 
| model | CV MAE ↓ | Test R² | 
|---|---|---|
| Linear Regression | 1 642 ± 394 | 0.79 | 
| Random Forest | 1 831 ± 51 | 0.90 | 
| Gradient Boosting | 2 742 ± 95 | 0.87 | 
| Voting Ensemble | 1 894 ± 44 | 0.89 | 
Random Forest brings the best MAE and R² without visible over‑fit.
- SHAP beeswarm → global drivers (top features: 
vehicle_age,mileage). - SHAP waterfall → why a single advert (row 39) is priced ± £9 k.
 - Partial dependence → price drops near‑linearly with age; flattening after ~15 yrs hints at a market floor.
 
All figures live in docs/images/, regenerated by src/visualise.py.
.
├── data/                # <empty> – you add Adverts.csv locally
├── notebooks/           # single exploratory notebook
├── src/                 # reusable code
├── configs/             # YAML config(s)
├── docs/images/         # plots for README
└── requirements.txt
