This repository contains a fraud detection pipeline for financial transactions, leveraging data preprocessing, feature engineering, class imbalance handling (SMOTE), and a diverse set of machine learning models (Logistic Regression, Random Forest, LightGBM, CatBoost, XGBoost, and ensemble methods).
Highlights:
- Novel feature engineering (time-based features, transaction amount bucketing, etc.)
- Handling imbalanced data via SMOTE
- Boosting algorithms (LightGBM, XGBoost, CatBoost) for high-dimensional data
- Advanced neural network approach with a supervised AutoEncoder for anomaly detection
- Stacking and voting ensembles for robust, high AUC-ROC performance
Our best model (LightGBM) achieved AUC-ROC of 0.89 on the Vesta Corporation dataset.
We use the Vesta Corporation dataset (part of a Kaggle competition) (https://www.kaggle.com/competitions/ieee-fraud-detection/overview) which includes:
- Transaction data (TransactionID, card info, transaction amount, time, etc.)
- Identity data (Device info, etc.)
Due to size and privacy concerns, the real dataset is not included in this repo.
Key columns:
- TransactionID
- isFraud(target)
- TransactionDT,- TransactionAmt
- Category features(ProductCD, card1, card2, etc.)
- Identity features(DeviceType, DeviceInfo)
- Data Preprocessing
- Missing value imputation
- High-correlation feature removal (via correlation heatmap)
- Encoding categorical features (one-hot or label encoding)
 
- Feature Engineering
- Transaction amount bucketing (micro, small, etc.)
- Time-based features (day-of-week, hour-of-day)
- Email domain grouping (e.g., major providers vs. niche)
 
- Handling Class Imbalance
- SMOTE (Synthetic Minority Oversampling Technique) to oversample the minority (fraud) class.
 
- Model Training
- Logistic Regression, Random Forest (baselines)
- LightGBM, CatBoost, XGBoost (boosting methods)
- Hyperparameter tuning via Bayesian Optimization
- AUC-ROC as primary metric
 
- Ensemble Methods
- Voting (soft voting across LGBM, CatBoost, XGB, etc.)
- Stacking with a meta-learner
 
- AutoEncoder (Optional Neural Approach)
- A supervised autoencoder that outputs fraud probability (or uses reconstruction error).
 
|------------------|----------|
| Model            | AUC-ROC  |
|------------------|----------|
| Logistic Reg     |   0.80   |
| Random Forest    |   0.855  |
| LightGBM         | **0.89** |
| CatBoost         |   0.881  |
| XGBoost          |   0.874  |
| Voting Ensembles |   0.86   |
| Stacking         |   0.88   |
| AutoEncoder      |   0.86   |
|------------------|----------|
LightGBM emerges as the top performer with 0.89 AUC-ROC, balancing speed and accuracy on this high-dimensional dataset.
- Clone the repo:
git clone https://github.com/YourUser/transaction-fraud-detection.git cd transaction-fraud-detection
- Set up environment:
(Create a requirements.txt if you like.)conda create -n fraud python=3.8 conda activate fraud pip install -r requirements.txt
- Jupyter Notebook:
Adjust paths as needed to point to your dataset.jupyter notebook notebooks/main.ipynb
- Explore other techniques for class imbalance (e.g., ADASYN, cost-sensitive learning).
- Investigate deeper neural network architectures or specialized anomaly detection methods.
- Implement real-time streaming pipelines (Spark Streaming, Kafka) for transaction-level fraud detection.
- Dataset by Vesta Corporation [https://www.kaggle.com/competitions/ieee-fraud-detection/overview].
- Project under Dr. Yanjie Fu, Arizona State University.
This project is released under the MIT License. That means you’re free to use, modify, and distribute the code, but you do so at your own risk.
Author: Varshith Dupati 
GitHub: @dvarshith 
Email: dvarshith942@gmail.com 
Issues: Please open an issue on this repo if you have questions or find bugs.