GitHub - WadhwaniAI/StudentDropoutEWS: Official repository of the Early Warning System in Education to mitigate school dropouts developed by Wadhwani AI in collaboration with VSK, Gujarat.

🗂️ OVERVIEW

Problem

Many students in India drop out of school due to diverse social, economic, and geographical factors.
Students enrolled in a given academic year (AY) but failing to re-enroll in the next AY are dropouts.
Education gaps lead to unskilled labour and are linked to poor health, impeding a nation's development.

Motivation

Dropout indicators are present in social traits, attendance patterns, and performance in assessments.
This project aims to build an Early Warning System (EWS) using machine learning (ML) techniques to predict students who are at risk of dropping out of school.
These predictions could potentially be used to cognize and design interventions to mitigate student dropouts.

Data Sources

The project has been developed using the following data that was provided by Vidya Samiksha Kendra (VSK)—Samagra Shiksha, Department of Education, state of Gujarat, India.
- Enrollment data: Delineates a student's regional and socioeconomic factors.
- Daily attendance data: Delineates a student's daily attendance (present, absent, or missing entry).
- Semester assessment data: Delineates a student's attendance and performance in examinations.
Data from the three sources is merged into a unified dataset with each row representing information pertaining to one student.
Customizable: Anyone with similar data could use this project by suitably modifying the Dataset Schema.

Formulation

EWS is formulated as a Binary Classification ML problem (dropout: label 1, not-dropout: label 0).
For a given AY, a binary Target for each student is derived using the enrollment data of the following AY.
The Input to the pipeline is the unified dataset (with the target column).
The resulting Output is a dataframe that includes the final set of features used in modeling and dropout probabilities for each student.
SHAP is used to explain the model's predictions.

🔧 SETUP

Clone the repository

git clone https://github.com/WadhwaniAI/StudentDropoutEWS.git
git checkout main
cd StudentDropoutEWS

Create a virtual environment and install the required packages

conda create --name ews python==3.12
conda activate ews
pip install -r requirements.txt

🧩 METADATA

The metadata directory contains various schemas necessary to use this repository.

Calendar of holidays

This is a mandatory nested JSON dictionary that holds information about holidays in AYs.
Example format: {"2223": {"6": {"sundays": [5, 12, 19, 26], "vacation": [1, 2]}}}
- Stores non-working dates for each AY (e.g., "2223" for AY 2022-23) and month (e.g., "6" for June, and "7" for July).
- Dates are integers under categories like "sundays", "festive", "vacation", or others (e.g., "pravesh utsav").
An example of this file for the AYs from 2022-23 to 2024-25 for the state of Gujarat is here.
Please edit the dictionary within this file for the AYs of your interest.
This file could either be manually populated from a PDF or parsed from a CSV of holidays for any AY.

Attendance Replacement Map

This is a mandatory JSON file that defines the semantic mapping for raw attendance values.
It is used to convert different raw data values (e.g., "1", "2", "nan") into the standardized representations for "present", "absent", and "missing".
The keys in this file must be "present", "absent", and "missing". The validation logic strictly checks for these keys, which correspond to the constants.Attendance.Status class in constants.py, ensuring consistency.
Please edit these constants for your dataset appropriately.

Predictor groups

This is an optional JSON dictionary logically categorizing similar features into predictor groups.
- They are not required for the training or inference pipelines.
- They are required only for explainability.
Predictor groups are used to explain a model's predictions and guide interventions.
Features may be manually organized into predictor groups depending on which aspects are most relevant for explanation.
- The features in Predictor groups must be a subset of the features used in modeling.
- We use programmatic inputs and guidance from VSK and UNICEF India to aggregate features into logical groups.
  - For example, features representing the location are grouped as "geographical_factors", and features representing attendance are grouped as "attendance_factors".
  - Note: Different strategies could be adopted to implement this aggregation and modify Predictor groups based on your unique needs.

📊 DATASET

A valid dataset for training and inference must satisfy the following requirements:

1. File Format

Dataset files must be in the pickle (.pkl) format. Other formats are not currently supported.

2. Naming Convention

The filename (excluding extension) must follow the pattern ay<academic_year>_grade<grade>.
This convention is used to extract metadata like the academic year and grade.
For example: dataset/ay2223_grade3.pkl.

3. Schema Conformance

The dataset must conform to the structure defined in the Dataset Schema.
This is a mandatory JSON dictionary (explained here) that defines the structure of a usable dataset.
The columns in your dataset must be a subset of those defined in the schema.
If your dataset uses different column names, you must update the Dataset Schema accordingly.
An illustrative sample dataset is shown here.
- Disclaimer: This sample dataset contains synthetically generated data for demonstration purposes. Any resemblance to real individuals or entities is purely coincidental.

⚙️ CONFIGURATION

A Config file is required to run the training or inference pipelines.
This file must conform to the structure defined in the Config Schema.
The schema file itself should not be edited. Instead, you should create a copy to use for your experiments.
For a detailed explanation of all the parameters, please see the Configuration README.

📉 TRAINING

To train a model, run main.py in train mode:

python -m src.main \
     --mode train \
     --config_source path/to/config_or_config_dir

Arguments:
----------
mode (str): Must be set to "train" to activate TRAINING mode.
config_source (str): Path to either a single config file or a directory containing multiple config JSONs.

Training generates the following artifacts in the experiment directory (created using config.exp.root_exps):
- Training and validation dataframes with dropout predictions
- Metric plots, and
- JSON file containing loss values over epochs.
If a directory of JSON configs is provided, experiments run in a loop.

🎯 INFERENCE

To run inference on a new dataset with a trained model, run main.py in infer mode:

python -m src.main \
     --mode infer \
     --exp_dir path/to/exp_dir \
     --inference_data_path path/to/inference_data.pkl

Arguments:
----------
mode (str): Must be set to "infer" to activate INFERENCE mode.
exp_dir (str): Path to a previous experiment directory (to use trained model and config).  
inference_data_path (str): Path to the inference dataset file in pickle format.

Inference generates and saves a dataframe with features and predicted probabilities in exp_dir.

💡 EXPLAINABILITY

To explain results, run main.py in explain mode:

python -m src.main \
    --mode explain \
    --exp_dir path/to/exp_dir \
    --df_path path/to/input_data.pkl \
    --predictor_groups path/to/predictor_groups.json \
    [--threshold 0.6] \
    [--target_recall 0.4]

Arguments:
----------
mode (str): Must be set to "explain" to activate explainability mode.
exp_dir (str): Path to the experiment directory with trained model, config and optional artifacts.
df_path (str): Path to the results dataset (containing prediction columns) to be explained (`.pkl` format).
predictor_groups (str OR Dict[str, List[str]]): Path to the JSON file containing mapping of features to groups OR the loaded dictionary.
threshold (float): (Optional) Manually specify the threshold for binary classification to generate output predictions.
target_recall (float): (Optional) Recall on validation set to compute threshold (if not provided/known)

The resulting output of this pipeline is a dataframe saved in exp_dir with SHAP values for each predictor group and top driving feature(s) for each prediction.

🙏 ACKNOWLEDGEMENTS

We acknowledge with gratitude the collaborative partnership that has made EWS possible. This project is a pioneering initiative between Vidya Samiksha Kendra (VSK)-Gujarat, Wadhwani Institute for Artificial Intelligence (Wadhwani AI), and UNICEF to harness Machine Learning to mitigate school dropouts. We express our sincere appreciation for VSK, specifically the MIS Department, for providing comprehensive student data and program support, without which this transformative project would not have been realized. EWS demonstrates the power of collaborative innovation in education, uniting government institutions, technology leaders, and program partners to create a meaningful impact for Gujarat's children.

🔔 NOTICE

As things stand now, this repository will NOT be maintained.
However, if you are interested in collaborating, please reach out to us at education@wadhwaniai.org. We would be happy to discuss and explore potential opportunities.

🛡️ LICENSE

This project is licensed under the Apache License 2.0 © 2025 Wadhwani Institute for Artificial Intelligence (Wadhwani AI).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
metadata		metadata
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Problem

Motivation

Data Sources

Formulation

1. File Format

2. Naming Convention

3. Schema Conformance

About

Uh oh!

Releases 2

Packages

Languages

License

WadhwaniAI/StudentDropoutEWS

Folders and files

Latest commit

History

Repository files navigation

Problem

Motivation

Data Sources

Formulation

1. File Format

2. Naming Convention

3. Schema Conformance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages