Get Kaggle API key from Kaggle website
mkdir ~/.kaggle/
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.jsonThis will:
- Create Python 3.9 virtual environment
- Activate the environment
- Install all dependencies from
requirements.txt
# Create virtual environment
python3.9 -m venv venv
# Activate
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtpython --version # Should show Python 3.9.x
pip list # Should show all installed packages# 1. Start services
docker-compose up -d && sleep 30
# 2. Download dataset (requires Kaggle API setup)
python download_dataset.py
# 3. Load data (choose sample or full)
python load_ieee_data.py
# 4. Apply Featureform definitions
featureform apply definitions.py --host localhost:7878 --insecure
# 5. Wait for READY (check dashboard: http://localhost)
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure
# 6. Train model
python train_model.py
# 7. Run inference
python inference.py| Step | Time | Notes |
|---|---|---|
| Docker startup | 30s | One-time |
| Dataset download | 2-3 min | One-time, ~500MB |
| Data loading | 30s | Into Postgres |
| Feature materialization | 30-60s | Featureform processing |
| Model training | 30-60s | XGBoost training |
| Inference (20 predictions) | <1s | Real-time serving |
Total first run: ~6-8 minutes
Subsequent runs (data cached): ~3-4 minutes
| Step | Time | Notes |
|---|---|---|
| Docker startup | 30s | One-time |
| Dataset download | 2-3 min | One-time, ~500MB |
| Data loading | 3-5 min | Into Postgres |
| Feature materialization | 2-5 min | Featureform processing |
| Model training | 3-5 min | XGBoost training |
| Inference (20 predictions) | <1s | Real-time serving |
Total first run: ~15-20 minutes
Subsequent runs (data cached): ~10-15 minutes
- Purpose: Download IEEE-CIS dataset from Kaggle
- Requirements: Kaggle API credentials at
~/.kaggle/kaggle.json - Output: CSVs in
data/directory (~500MB) - Runtime: 2-3 minutes
- Purpose: Load CSV data into Postgres tables
- Interactive: Asks for sample (50K) or full (590K)
- Processing:
- Creates
ieee_transactiontable - Creates
ieee_identitytable - Creates indexes
- Shows data statistics
- Creates
- Runtime: 30 seconds (sample) or 3-5 minutes (full)
- Purpose: Define feature engineering pipeline in Featureform
- Creates:
- 2 source tables
- 4 SQL transformations
- 8 features
- 1 training set
- Not run directly: Applied with
featureform apply
- Purpose: Train XGBoost fraud detection model
- Process:
- Loads training set from Featureform
- Splits into train/validation (80/20)
- Trains XGBoost with class balancing
- Evaluates with fraud-specific metrics
- Saves model to
models/
- Output:
- Trained model file
- Performance metrics
- Feature importance
- Runtime: 30 seconds (sample) or 3-5 minutes (full)
- Purpose: Demonstrate production inference patterns
- Demonstrates:
- Batch inference (score multiple transactions)
- Real-time inference (single transaction)
- Latency measurements
- Production readiness assessment
- Output:
- Predictions
- Latency statistics (mean, median, p95, p99)
- Throughput estimates
- Runtime: <1 second for 20 predictions
ROC-AUC: 0.92-0.95
Average Precision: 0.70-0.80
At 50% threshold:
Precision: 0.85-0.90
Recall: 0.75-0.85
Class distribution:
Legitimate: 96.5%
Fraud: 3.5%
Real-time inference (end-to-end):
Mean: 12-18 ms
Median: 10-15 ms
P95: 20-30 ms
P99: 30-50 ms
Breakdown:
Feature serving (Redis): 8-15 ms (70-80%)
Model prediction: 1-3 ms (20-30%)
Throughput (single thread):
~60-100 predictions/second
# Via CLI
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure
# Status will show:
# PENDING β RUNNING β READY (or FAILED)open http://localhostNavigate to:
- Sources: See ieee_transaction, ieee_identity
- Transformations: See card_aggregates, email_aggregates, etc.
- Features: See all 8 features
- Training Sets: See fraud_detection
# All services
docker-compose logs -f
# Specific service
docker-compose logs -f featureform
docker-compose logs -f postgres
docker-compose logs -f redis# Setup Kaggle API
# 1. Go to https://www.kaggle.com/account
# 2. Click "Create New API Token"
# 3. Download kaggle.json
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json# Check if services are running
docker-compose ps
# Restart if needed
docker-compose restart postgres
# Wait and try again
sleep 10
python verify.py# Check Featureform logs
docker-compose logs featureform
# Common causes:
# - SQL syntax error in transformation
# - Column name mismatch
# - Dependency not ready
# Try reapplying
featureform apply definitions_v2.py --host localhost:7878 --insecure# Make sure definitions were applied
featureform apply definitions_v2.py --host localhost:7878 --insecure
# Wait for READY status
featureform get training-set fraud_detection v2 --host localhost:7878 --insecure
# Check dashboard
open http://localhost# Make sure training completed successfully
ls -lh models/
# If missing, retrain
python train_model.pydocker-compose restartdocker-compose down -v
rm -rf data/ models/
./setup_ieee.sh-
Experiment with features
- Edit
definitions_v2.py - Add time-based features
- Create interaction features
- Edit
-
Tune the model
- Edit parameters in
train_model.py - Try different thresholds
- Implement cross-validation
- Edit parameters in
-
Production deployment
- Deploy to Kubernetes
- Add monitoring
- Implement A/B testing
- Setup feature refresh schedules
-
Advanced features
- Add streaming features
- Implement feature importance tracking
- Add model explainability (SHAP)
- Create feature monitoring dashboards
.
βββ setup_ieee.sh # Automated setup
βββ download_dataset.py # Download from Kaggle
βββ load_ieee_data.py # Load into Postgres
βββ definitions_v2.py # Featureform definitions
βββ train_model.py # Train XGBoost
βββ inference.py # Run inference
βββ docker-compose.yml # Infrastructure
βββ README_IEEE.md # Full documentation
βββ QUICK_REFERENCE.md # This file
- Start with sample data for faster iteration
- Use the dashboard to visualize dependencies
- Check logs when things don't work
- Monitor latencies to understand bottlenecks
- Version your features using variants
[Data Loading] Completed in 15.23 seconds
β Loaded 47,237 training examples
[Model Training] Completed in 45.67 seconds
β Model trained successfully
Best iteration: 87
Best score: 0.9456
ROC-AUC Score: 0.9234
Average Precision Score: 0.7891
β C121424: FRAUD (prob=0.892) [12.3ms total, 9.1ms serving]
β C543210: LEGIT (prob=0.034) [11.8ms total, 8.9ms serving]
Latency Statistics:
Mean: 12.45 ms
Median: 11.50 ms
P95: 15.23 ms
P99: 18.91 ms
β Excellent - Suitable for real-time transaction processing
Need help? Check the full documentation in README_IEEE.md