This project presents a full-cycle Credit Risk Modeling solution to predict the likelihood of a borrower defaulting on a loan. It involves meticulous data cleaning, feature engineering, model training, business-aligned metric optimization, and deployment using Streamlit. Designed with real-world financial services impact in mind, the model prioritizes recall to minimize false negatives (i.e., not catching risky borrowers).
- Goal: Predict whether a borrower will default on a loan.
- Dataset: Provided by a financial institution with borrower-level and loan-level details.
- Target Variable:
default(1 = default, 0 = not default) - Business Objective: High recall for defaulters to minimize risk exposure.
- Deployment: Web app hosted using Streamlit Cloud.
The dataset was highly imbalanced:
- Techniques used: SMOTE-Tomek, oversampling, and threshold tuning.
Handled properly by eliminating leak-prone features like disbursal_date, installment_start_dt, and derived leakage indicators.
Boxplots revealed processing_fee > loan_amount, which is invalid. These anomalies were cleaned or capped appropriately.
loan_purposecleaned and grouped into standard categories.- One-hot encoding and WoE/IV analysis used for feature transformation and selection.
- Loan-to-Income Ratio (LTI):
loan_amount / income - Delinquency Ratio
- Average DPD per Delinquency
- High LTI, delinquency_ratio, and avg_dpd_per_delinquency were strong predictors of default.
- Defaulted customers had younger age, longer loan tenure, and higher credit utilization.
Dropped correlated features: sanction_amount, processing_fee, gst, net_disbursement, principal_outstanding.
Top features:
credit_utilization_ratioavg_dpd_per_delinquencyloan_to_incomeloan_purposeresidence_typeloan_tenure_monthsloan_typeage, etc.
| Model | Accuracy | Recall (Defaulters) |
|---|---|---|
| Logistic Regression (Basic) | 96% | 0.70 |
| Random Forest | 96% | 0.69 |
| XGBoost | 96% | 0.75 |
- Logistic Regression
- SMOTE-Tomek
- Optuna for Hyperparameter Tuning
- Business chose Logistic Regression for explainability
- Accuracy: 93%
- Recall (Defaulters): 0.95
- AUC: 98.3%
- Gini Coefficient: 0.967
- App Framework: Streamlit
- Main Files:
main.py,prediction_helper.py - Hosting: Streamlit Cloud
- Enables better credit risk filtering.
- High recall helps reduce bad debt.
- Easy model interpretability aids compliance and auditing.
Advance_Credit_Risk_Model_Loan_prediction/
βββ data/
βββ notebooks/
βββ main.py
βββ prediction_helper.py
βββ README.md
βββ requirements.txt
βββ images/
β βββ ks_statistic.png
β βββ roc_curve.png
β βββ confusion_matrix.png
β βββ streamlit_app_screenshot.png
β βββ metrics.png
β βββ feature_importance.png
βββ artifacts/
β βββ modeldata.joblib
- Mehul Ligade
- GitHub: @mehulcode12
- CodeBasics
- GitHub: @mehulcode12
This project was completed as part of the Codebasics Data Science Bootcamp. Special thanks to mentors and the open-source community for libraries and frameworks.
You are welcome to use this project as a reference. Please give credit to CodeBasics if you find it helpful.




