IEEE-CIS Fraud Detection - Quick Reference

🚀 Complete Workflow

Set up Kaggle (MacOS)

Get Kaggle API key from Kaggle website

mkdir ~/.kaggle/
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

Setup Virtual Environment

This will:

Create Python 3.9 virtual environment
Activate the environment
Install all dependencies from requirements.txt

# Create virtual environment
python3.9 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Verify Installation

python --version  # Should show Python 3.9.x
pip list          # Should show all installed packages

Step-by-Step

# 1. Start services
docker-compose up -d && sleep 30

# 2. Download dataset (requires Kaggle API setup)
python download_dataset.py

# 3. Load data (choose sample or full)
python load_ieee_data.py

# 4. Apply Featureform definitions
featureform apply definitions.py --host localhost:7878 --insecure

# 5. Wait for READY (check dashboard: http://localhost)
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# 6. Train model
python train_model.py

# 7. Run inference
python inference.py

⏱️ Expected Timings

With Sample Dataset (50,000 rows)

Step	Time	Notes
Docker startup	30s	One-time
Dataset download	2-3 min	One-time, ~500MB
Data loading	30s	Into Postgres
Feature materialization	30-60s	Featureform processing
Model training	30-60s	XGBoost training
Inference (20 predictions)	<1s	Real-time serving

Total first run: ~6-8 minutes
Subsequent runs (data cached): ~3-4 minutes

With Full Dataset (590,000 rows)

Step	Time	Notes
Docker startup	30s	One-time
Dataset download	2-3 min	One-time, ~500MB
Data loading	3-5 min	Into Postgres
Feature materialization	2-5 min	Featureform processing
Model training	3-5 min	XGBoost training
Inference (20 predictions)	<1s	Real-time serving

Total first run: ~15-20 minutes
Subsequent runs (data cached): ~10-15 minutes

📊 What Each Script Does

download_dataset.py

Purpose: Download IEEE-CIS dataset from Kaggle
Requirements: Kaggle API credentials at ~/.kaggle/kaggle.json
Output: CSVs in data/ directory (~500MB)
Runtime: 2-3 minutes

load_ieee_data.py

Purpose: Load CSV data into Postgres tables
Interactive: Asks for sample (50K) or full (590K)
Processing:
- Creates ieee_transaction table
- Creates ieee_identity table
- Creates indexes
- Shows data statistics
Runtime: 30 seconds (sample) or 3-5 minutes (full)

definitions_v2.py

Purpose: Define feature engineering pipeline in Featureform
Creates:
- 2 source tables
- 4 SQL transformations
- 8 features
- 1 training set
Not run directly: Applied with featureform apply

train_model.py

Purpose: Train XGBoost fraud detection model
Process:
1. Loads training set from Featureform
2. Splits into train/validation (80/20)
3. Trains XGBoost with class balancing
4. Evaluates with fraud-specific metrics
5. Saves model to models/
Output:
- Trained model file
- Performance metrics
- Feature importance
Runtime: 30 seconds (sample) or 3-5 minutes (full)

inference.py

Purpose: Demonstrate production inference patterns
Demonstrates:
1. Batch inference (score multiple transactions)
2. Real-time inference (single transaction)
3. Latency measurements
4. Production readiness assessment
Output:
- Predictions
- Latency statistics (mean, median, p95, p99)
- Throughput estimates
Runtime: <1 second for 20 predictions

🎯 Expected Results

Model Performance (Full Dataset)

ROC-AUC: 0.92-0.95
Average Precision: 0.70-0.80

At 50% threshold:
  Precision: 0.85-0.90
  Recall: 0.75-0.85
  
Class distribution:
  Legitimate: 96.5%
  Fraud: 3.5%

Inference Latency

Real-time inference (end-to-end):
  Mean: 12-18 ms
  Median: 10-15 ms
  P95: 20-30 ms
  P99: 30-50 ms

Breakdown:
  Feature serving (Redis): 8-15 ms (70-80%)
  Model prediction: 1-3 ms (20-30%)

Throughput (single thread):
  ~60-100 predictions/second

🔍 Monitoring Progress

Check Featureform Status

# Via CLI
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# Status will show:
# PENDING → RUNNING → READY (or FAILED)

Via Dashboard

open http://localhost

Navigate to:

Sources: See ieee_transaction, ieee_identity
Transformations: See card_aggregates, email_aggregates, etc.
Features: See all 8 features
Training Sets: See fraud_detection

Check Docker Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f featureform
docker-compose logs -f postgres
docker-compose logs -f redis

🐛 Troubleshooting

"Kaggle credentials not found"

# Setup Kaggle API
# 1. Go to https://www.kaggle.com/account
# 2. Click "Create New API Token"
# 3. Download kaggle.json
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

"Postgres connection failed"

# Check if services are running
docker-compose ps

# Restart if needed
docker-compose restart postgres

# Wait and try again
sleep 10
python verify.py

"Resources stuck in PENDING"

# Check Featureform logs
docker-compose logs featureform

# Common causes:
# - SQL syntax error in transformation
# - Column name mismatch
# - Dependency not ready

# Try reapplying
featureform apply definitions_v2.py --host localhost:7878 --insecure

"Training set not found"

# Make sure definitions were applied
featureform apply definitions_v2.py --host localhost:7878 --insecure

# Wait for READY status
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# Check dashboard
open http://localhost

"Model file not found"

# Make sure training completed successfully
ls -lh models/

# If missing, retrain
python train_model.py

🔄 Resetting Everything

Soft reset (keep data)

docker-compose restart

Hard reset (remove all data)

docker-compose down -v
rm -rf data/ models/
./setup_ieee.sh

📈 Next Steps After Basic Setup

Experiment with features
- Edit definitions_v2.py
- Add time-based features
- Create interaction features
Tune the model
- Edit parameters in train_model.py
- Try different thresholds
- Implement cross-validation
Production deployment
- Deploy to Kubernetes
- Add monitoring
- Implement A/B testing
- Setup feature refresh schedules
Advanced features
- Add streaming features
- Implement feature importance tracking
- Add model explainability (SHAP)
- Create feature monitoring dashboards

📚 Key Files Reference

.
├── setup_ieee.sh              # Automated setup
├── download_dataset.py        # Download from Kaggle
├── load_ieee_data.py          # Load into Postgres
├── definitions_v2.py          # Featureform definitions
├── train_model.py             # Train XGBoost
├── inference.py               # Run inference
├── docker-compose.yml         # Infrastructure
├── README_IEEE.md             # Full documentation
└── QUICK_REFERENCE.md         # This file

💡 Pro Tips

Start with sample data for faster iteration
Use the dashboard to visualize dependencies
Check logs when things don't work
Monitor latencies to understand bottlenecks
Version your features using variants

🎓 Understanding the Output

Training Output

[Data Loading] Completed in 15.23 seconds
✓ Loaded 47,237 training examples

[Model Training] Completed in 45.67 seconds
✓ Model trained successfully
  Best iteration: 87
  Best score: 0.9456

ROC-AUC Score: 0.9234
Average Precision Score: 0.7891

Inference Output

✓ C121424: FRAUD (prob=0.892) [12.3ms total, 9.1ms serving]
✓ C543210: LEGIT (prob=0.034) [11.8ms total, 8.9ms serving]

Latency Statistics:
  Mean: 12.45 ms
  Median: 11.50 ms
  P95: 15.23 ms
  P99: 18.91 ms

✓ Excellent - Suitable for real-time transaction processing

🔗 Useful Links

Need help? Check the full documentation in README_IEEE.md

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
experiment_results		experiment_results
.gitignore		.gitignore
README.md		README.md
definitions.py		definitions.py
docker-compose.yml		docker-compose.yml
download_dataset.py		download_dataset.py
inference.py		inference.py
load_ieee_data.py		load_ieee_data.py
makefile		makefile
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
setup.sh		setup.sh
train_model.py		train_model.py

redis-applied-ai/2_feature-form-ieee-cis-fd

Folders and files

Latest commit

History

Repository files navigation