Skip to content

redis-applied-ai/2_feature-form-ieee-cis-fd

Repository files navigation

IEEE-CIS Fraud Detection - Quick Reference

πŸš€ Complete Workflow

Set up Kaggle (MacOS)

Get Kaggle API key from Kaggle website

mkdir ~/.kaggle/
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json

Setup Virtual Environment

This will:

  • Create Python 3.9 virtual environment
  • Activate the environment
  • Install all dependencies from requirements.txt
# Create virtual environment
python3.9 -m venv venv

# Activate
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Verify Installation

python --version  # Should show Python 3.9.x
pip list          # Should show all installed packages

Step-by-Step

# 1. Start services
docker-compose up -d && sleep 30

# 2. Download dataset (requires Kaggle API setup)
python download_dataset.py

# 3. Load data (choose sample or full)
python load_ieee_data.py

# 4. Apply Featureform definitions
featureform apply definitions.py --host localhost:7878 --insecure

# 5. Wait for READY (check dashboard: http://localhost)
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# 6. Train model
python train_model.py

# 7. Run inference
python inference.py

⏱️ Expected Timings

With Sample Dataset (50,000 rows)

Step Time Notes
Docker startup 30s One-time
Dataset download 2-3 min One-time, ~500MB
Data loading 30s Into Postgres
Feature materialization 30-60s Featureform processing
Model training 30-60s XGBoost training
Inference (20 predictions) <1s Real-time serving

Total first run: ~6-8 minutes
Subsequent runs (data cached): ~3-4 minutes

With Full Dataset (590,000 rows)

Step Time Notes
Docker startup 30s One-time
Dataset download 2-3 min One-time, ~500MB
Data loading 3-5 min Into Postgres
Feature materialization 2-5 min Featureform processing
Model training 3-5 min XGBoost training
Inference (20 predictions) <1s Real-time serving

Total first run: ~15-20 minutes
Subsequent runs (data cached): ~10-15 minutes

πŸ“Š What Each Script Does

download_dataset.py

  • Purpose: Download IEEE-CIS dataset from Kaggle
  • Requirements: Kaggle API credentials at ~/.kaggle/kaggle.json
  • Output: CSVs in data/ directory (~500MB)
  • Runtime: 2-3 minutes

load_ieee_data.py

  • Purpose: Load CSV data into Postgres tables
  • Interactive: Asks for sample (50K) or full (590K)
  • Processing:
    • Creates ieee_transaction table
    • Creates ieee_identity table
    • Creates indexes
    • Shows data statistics
  • Runtime: 30 seconds (sample) or 3-5 minutes (full)

definitions_v2.py

  • Purpose: Define feature engineering pipeline in Featureform
  • Creates:
    • 2 source tables
    • 4 SQL transformations
    • 8 features
    • 1 training set
  • Not run directly: Applied with featureform apply

train_model.py

  • Purpose: Train XGBoost fraud detection model
  • Process:
    1. Loads training set from Featureform
    2. Splits into train/validation (80/20)
    3. Trains XGBoost with class balancing
    4. Evaluates with fraud-specific metrics
    5. Saves model to models/
  • Output:
    • Trained model file
    • Performance metrics
    • Feature importance
  • Runtime: 30 seconds (sample) or 3-5 minutes (full)

inference.py

  • Purpose: Demonstrate production inference patterns
  • Demonstrates:
    1. Batch inference (score multiple transactions)
    2. Real-time inference (single transaction)
    3. Latency measurements
    4. Production readiness assessment
  • Output:
    • Predictions
    • Latency statistics (mean, median, p95, p99)
    • Throughput estimates
  • Runtime: <1 second for 20 predictions

🎯 Expected Results

Model Performance (Full Dataset)

ROC-AUC: 0.92-0.95
Average Precision: 0.70-0.80

At 50% threshold:
  Precision: 0.85-0.90
  Recall: 0.75-0.85
  
Class distribution:
  Legitimate: 96.5%
  Fraud: 3.5%

Inference Latency

Real-time inference (end-to-end):
  Mean: 12-18 ms
  Median: 10-15 ms
  P95: 20-30 ms
  P99: 30-50 ms

Breakdown:
  Feature serving (Redis): 8-15 ms (70-80%)
  Model prediction: 1-3 ms (20-30%)

Throughput (single thread):
  ~60-100 predictions/second

πŸ” Monitoring Progress

Check Featureform Status

# Via CLI
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# Status will show:
# PENDING β†’ RUNNING β†’ READY (or FAILED)

Via Dashboard

open http://localhost

Navigate to:

  • Sources: See ieee_transaction, ieee_identity
  • Transformations: See card_aggregates, email_aggregates, etc.
  • Features: See all 8 features
  • Training Sets: See fraud_detection

Check Docker Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f featureform
docker-compose logs -f postgres
docker-compose logs -f redis

πŸ› Troubleshooting

"Kaggle credentials not found"

# Setup Kaggle API
# 1. Go to https://www.kaggle.com/account
# 2. Click "Create New API Token"
# 3. Download kaggle.json
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

"Postgres connection failed"

# Check if services are running
docker-compose ps

# Restart if needed
docker-compose restart postgres

# Wait and try again
sleep 10
python verify.py

"Resources stuck in PENDING"

# Check Featureform logs
docker-compose logs featureform

# Common causes:
# - SQL syntax error in transformation
# - Column name mismatch
# - Dependency not ready

# Try reapplying
featureform apply definitions_v2.py --host localhost:7878 --insecure

"Training set not found"

# Make sure definitions were applied
featureform apply definitions_v2.py --host localhost:7878 --insecure

# Wait for READY status
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure

# Check dashboard
open http://localhost

"Model file not found"

# Make sure training completed successfully
ls -lh models/

# If missing, retrain
python train_model.py

πŸ”„ Resetting Everything

Soft reset (keep data)

docker-compose restart

Hard reset (remove all data)

docker-compose down -v
rm -rf data/ models/
./setup_ieee.sh

πŸ“ˆ Next Steps After Basic Setup

  1. Experiment with features

    • Edit definitions_v2.py
    • Add time-based features
    • Create interaction features
  2. Tune the model

    • Edit parameters in train_model.py
    • Try different thresholds
    • Implement cross-validation
  3. Production deployment

    • Deploy to Kubernetes
    • Add monitoring
    • Implement A/B testing
    • Setup feature refresh schedules
  4. Advanced features

    • Add streaming features
    • Implement feature importance tracking
    • Add model explainability (SHAP)
    • Create feature monitoring dashboards

πŸ“š Key Files Reference

.
β”œβ”€β”€ setup_ieee.sh              # Automated setup
β”œβ”€β”€ download_dataset.py        # Download from Kaggle
β”œβ”€β”€ load_ieee_data.py          # Load into Postgres
β”œβ”€β”€ definitions_v2.py          # Featureform definitions
β”œβ”€β”€ train_model.py             # Train XGBoost
β”œβ”€β”€ inference.py               # Run inference
β”œβ”€β”€ docker-compose.yml         # Infrastructure
β”œβ”€β”€ README_IEEE.md             # Full documentation
└── QUICK_REFERENCE.md         # This file

πŸ’‘ Pro Tips

  1. Start with sample data for faster iteration
  2. Use the dashboard to visualize dependencies
  3. Check logs when things don't work
  4. Monitor latencies to understand bottlenecks
  5. Version your features using variants

πŸŽ“ Understanding the Output

Training Output

[Data Loading] Completed in 15.23 seconds
βœ“ Loaded 47,237 training examples

[Model Training] Completed in 45.67 seconds
βœ“ Model trained successfully
  Best iteration: 87
  Best score: 0.9456

ROC-AUC Score: 0.9234
Average Precision Score: 0.7891

Inference Output

βœ“ C121424: FRAUD (prob=0.892) [12.3ms total, 9.1ms serving]
βœ“ C543210: LEGIT (prob=0.034) [11.8ms total, 8.9ms serving]

Latency Statistics:
  Mean: 12.45 ms
  Median: 11.50 ms
  P95: 15.23 ms
  P99: 18.91 ms

βœ“ Excellent - Suitable for real-time transaction processing

πŸ”— Useful Links


Need help? Check the full documentation in README_IEEE.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published