This project implements a human pose classifier using Visual Transformers (ViT), covering all required steps from data preprocessing to real-world inference.
The best-performing model achieves high classification metrics:
| Metric | Value | 
|---|---|
| Accuracy | 0.791 | 
| Precision (weighted) | 0.792 | 
| Recall (weighted) | 0.791 | 
| F1-Score (weighted) | 0.790 | 
The model can be deployed on an Amazon EC2 Instance, and a live prototype is accessible via Streamlit Server: https://cv-human-pose-classifier-vit-aws.streamlit.app/
- Data Source: Human action images are obtained from the Bingsu/Human_Action_Recognition dataset, loaded and preprocessed with the Hugging Face library.
- Model Architecture: A Vision Transformer (ViT) fine-tuned to classify images into 15 action categories, such as calling, clapping, cycling, dancing, drinking, eating, fighting, hugging, laughing, listening_to_music, running, sitting, sleeping, texting, using_laptop.
- AWS Integration: Trained models are automatically uploaded to S3 Bucket using boto3, enabling easy retrieval during deployment.
- Deployment: Model can be served on EC2 Instance
- Web Interface: Interactive inference via FastAPI backend and Streamlit frontend.
Preprocessing Pipeline (src/preprocessing_pipeline.py)
The preprocessing stage is designed to be fully configurable through the JSON file config/preprocessing_config.json. This allows the dataset loading and splitting process to be easily adapted without modifying the code.
| Parameter | Type | Default | Description | 
|---|---|---|---|
| huggingface_dataset_name | str | "Bingsu/Human_Action_Recognition" | Name of the dataset on Hugging Face Hub. | 
| test_size | float | 0.2 | Proportion of the dataset reserved for testing. Must be strictly between 0 and 1. | 
| validation_size | float | 0.2 | Proportion of the training set reserved for validation. Must be strictly between 0 and 1. | 
| output_dir | str | "data" | Directory where the processed splits will be saved. | 
The main steps of the data preprocessing pipeline are as follows:
- 
Download the dataset from Hugging Face using the name specified in the configuration 
- 
Train/Validation/Test Split: - 
First split: dataset is divided into training and testing sets ( test_sizeratio).
- 
Second split: training set is further divided into training and validation subsets ( validation_sizeratio).
- 
Ensures that all subsets (train, validation, test) contain the same label classes to avoid imbalance issues. 
 ⚠️ Note (August 2025): The official test set of theBingsu/Human_Action_Recognitiondataset is currently erroneous. It only contains a single class ("calling") but with images from all 15 classes. For this reason, the test set in our pipeline is re-sampled from the training split to ensure proper evaluation.
- 
- 
Dataset Description: prints a statistical overview of each split (number of samples per class, total size, etc.). 
- 
Save to Disk: store the three subsets in the specified output_dirastrain/,val/, andtest/.
Training Pipeline (src/training_pipeline.py)
The training process is fully configurable through the JSON config file config/training_config.json, which is loaded using Pydantic schemas. This allows changing model, training parameters, and output directories without touching the core code.
| Config Section | Field | Type | Description | 
|---|---|---|---|
| directories_config | input_dir | str | Path to load preprocessed train/andval/datasets | 
| clean_train_dir_before_training | bool | Whether to clean checkpoints before training (default: True) | |
| train_dir | str | Directory to save training checkpoints | |
| training_curve_path | str | Path to save training & validation loss/accuracy curves | |
| best_model_path | str | Path to save the best model after training | |
| model_params | model_name | str | Pretrained model name (e.g., google/vit-base-patch16-224) | 
| nb_layers_to_freeze | int | Number of ViT encoder layers to freeze (0–12) (default: None, i.e., train all encoder layers) | |
| training_config | enable_gpu | bool | Enable GPU training if available  (default: False) | 
| learning_rate | float | Learning rate for the optimizer | |
| batch_size | int | Training and validation batch size | |
| num_train_epochs | int | Number of training epochs | 
The main steps of the train pipeline are as follows:
- 
Load Preprocessed Data - Train and validation datasets are loaded from the specified input_dir.
- Checks that both sets contain the same label classes.
- Creates a mapping label2idandid2labelfor consistent training and evaluation.
 
- Train and validation datasets are loaded from the specified 
- 
Model Building - Loads the pretrained model (ViT) from Hugging Face.
- Optionally freezes the first nb_layers_to_freezeencoder layers.
- Initializes image preprocessing transforms (RandomResizedCrop,ToTensor,Normalize).
- Selects device (CPUorGPU).
 
- Loads the pretrained model (
- 
Apply Transforms - Register the transformation steps in both train and validation sets (applied on-the-fly during training).
 
- 
Training Arguments Setup - Configures TrainingArguments(optimizer, logging, evaluation & save strategy).
- Sets metric for best model selection (accuracy).
 
- Configures 
- 
Training Loop - Cleans previous checkpoints if clean_train_dir_before_training=True.
- Trains the model with Hugging Face Trainer.
 
- Cleans previous checkpoints if 
- 
Metrics & Curves - Tracks train loss, validation loss, and validation accuracy.
- Plots training/validation curves and saves them to training_curve_path.
 
- 
Save the pipeline - Saves the best model to best_model_path.
- Saves preprocessing transforms (transforms.pkl) for later inference and testing.
 
- Saves the best model to 
Testing Pipeline (src/testing_pipeline.py)
The testing process is fully configurable through the JSON config file config/testing_config.json, which is loaded using Pydantic schemas. This allows changing model pushing conditions without modifying the core code.
| Parameter | Type | Description | 
|---|---|---|
| input_dir | str | Path to load the preprocessed test dataset | 
| trained_model_path | str | Path to the trained model to be loaded for testing | 
| metrics_output_file | str | File path to save evaluation results (Excel) | 
| push_model_s3.enabled | bool | If true, allows pushing the model to an S3 bucket if defined conditions are met | 
| push_model_s3.conditions | list | List of metric-based conditions that must be satisfied to trigger a model push | 
| push_model_s3.conditions.metric | str | Name of the metric to check (e.g., accuracy,precision) | 
| push_model_s3.conditions.threshold | float | Minimum required value for the metric to allow model upload to S3 Bucket | 
| push_model_s3.bucket_name | str | Name of the S3 bucket where the model will be uploaded | 
| push_model_s3.prefix | str | Folder or path prefix in the bucket under which the model will be stored | 
The main steps of the testing pipeline are as follows:
- 
Load Test Data: loads the preprocessed test dataset from the specified input_dir.
- 
Load Model and Transforms - Loads the trained ViT model from trained_model_path.
- Loads the preprocessing transforms (transforms.pkl) saved during training.
 
- Loads the trained ViT model from 
- 
Predictions - Calculates predictions on the entire test dataset.
- Collects both predicted and ground truth labels.
 
- 
Metrics & Reporting - Computes global metrics (accuracy, precision, recall, F1-score).
- Generates confusion matrix and per-class accuracy.
- Saves all results into the Excel file defined by metrics_output_file. An example output file can be found in data/output folder.
 
- 
Model Push to S3 (Optional) - If enabled, checks whether the evaluation metrics meet the configured thresholds.
- If all conditions are satisfied, uploads the model directory to the specified S3 bucket.
 
PUT SLIDES HERE
- 
Create a User using AWS IAM Service with the following permissions: - AmazonEC2FullAccess
- AmazonS3FullAccess
 (You can also create a custom policy for more restricted access if needed.) 
- 
Generate an Access Key for this IAM user and save the following securely: - Access Key ID
- Secret Access Key
 
- 
The S3 Bucket will be created dynamically during the execution of the testing_pipeline.py. If the best found model during training achieves the required score in the testing_config.json, it will be uploaded to this bucket. 
 ⚠️ (Make sure the bucket name you configure is globally unique to avoid conflicts.)
- 
Create an EC2 instance with the following specifications: - AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.7 (Ubuntu 22.04)
- Instance type: t3.medium
- Key Pair: Use the key pair you downloaded to connect via SSH
- Security Group:
- Allow inbound rules for ports: 22,80,8501,8502(all TCP)
 
- Allow inbound rules for ports: 
- Storage: 120 GiB (gp3)
 
- 
Run the EC2 instance and interact with it via SSH or any other remote access method to set up the environment: - Create a working directory:
mkdir mlops cd mlops
- Clone the GitHub repo:
git clone https://github.com/Lahdhirim/CV-human-pose-classifier-ViT-aws.git cd CV-human-pose-classifier-ViT-aws
- Install dependencies:
pip install -r requirements.txt 
- Configure the AWS credentials using AWS CLI:
aws configure Press ENTER AWS Access Key ID: ************ AWS Secret Access Key: ************ Default region name: Press ENTER Default output format: Press ENTER 
- Add Streamlit to PATH (for command-line use):
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc source ~/.bashrc 
 
- Create a working directory:
- 
(Optional but recommended) Run the application from the terminal to ensure that everything is working correctly: python3 src/web_app/server.py in a new terminal: streamlit run src/web_app/interface.py Normally, at this stage, if everything works fine, the application is accessible at http://<public IPv4 address>:8501
- 
Automatically launch Streamlit on instance reboot: - 
Create the startup script: nano /home/ubuntu/start_streamlit.sh Paste the following: #!/bin/bash cd /home/ubuntu/mlops/CV-human-pose-classifier-ViT-aws source /home/ubuntu/.bashrc nohup /usr/bin/python3 src/web_app/server.py >> /home/ubuntu/fastapi.log 2>&1 & sleep 20 nohup /home/ubuntu/.local/bin/streamlit run src/web_app/interface.py --server.port 8501 >> /home/ubuntu/streamlit.log 2>&1 & 
- 
Make the script executable: chmod +x /home/ubuntu/start_streamlit.sh 
- 
Add the script to crontabfor reboot:crontab -e Add the following line at the end of the file: @reboot /home/ubuntu/start_streamlit.sh 
 
- 
Each time the instance is rebooted, Streamlit will automatically launch the web application at the address  http://<public IPv4 address>:8501. Two log files named  fastapi.log and streamlit.log will be created in the  /home/ubuntu directory. These files can be used to monitor the application’s status and debug any errors.
The application will be publicly accessible to anyone with the instance’s public IP address. Access can be controlled via the EC2 Security Group:
- To allow access from any IP address, set the Source  to 0.0.0.0/0on TCP port8501.
 ⚠️ Use0.0.0.0/0only if you're aware of the security implications. For more restricted access, specify your own IP or a limited range.
There are four main execution modes, recommended in the following order:
python main.py preprocess_datapython main.py trainpython main.py testpython main.py inferenceThis classifier is designed not only for human action recognition but also as a flexible image classification framework. Thanks to its modular pipeline and configuration-driven design:
- Configurable Pipelines: Each stage of the workflow (preprocessing, training, testing and inference) is controlled by its own configuration file (preprocessing_config.json,training_config.json,testing_config.json,inference_config.json).
- No Code Changes Required: You can adapt the application to new datasets or classification tasks simply by updating the configuration files.
- End-to-End Workflow: From data preprocessing to model training, evaluation, and deployment, all steps are fully automated and modular.
- Rapid Deployment: The same FastAPI + Streamlit interface can serve any trained model without modification, making it suitable for a wide range of computer vision tasks.
