GitHub - ajithvcoder/E2EMLOps: AWS EKS + ArgoCD + Github Actions + Hugging face + AWS Lambda The School of AI EMLO-V4 course Capstone Assignment

EMLOV4 Session-18 Capstone Assignment

Deliverables for Capstone

Following are the deployments made in this Capstone assignment

Deployment 01 - Kubernetes - Video Link - Train a model with vegetable fruits and sports dataset and perform a argo cd deployment with load testing with github actions. Workflow Result 1 , Workflow Result 2
Deployment 02 - Hugging Face - Perform a hugging face deployment with Gradio with github actions. Live Link
Deployment 03 - AWS Lambda - Live Link - Perform a AWS Lambda deployment with Gradio with github actions.

Note: Video link and 2 live urls are available above which are Capstone assignment deliverables. (Kindly run at 2x as its a long video. Results are at end fo the video)

Requirements

On Pull Request to the main branch:

Trigger a model training process (using EC2 or GitHub Actions Runner).
Compare the evaluation metrics of the freshly trained model with those of the current production model.

On Push to the main branch:

Retrain the model using the latest dataset.
Store the updated model in S3. - stage - env
Update Kubernetes manifest files to reference the new model. - stage
You can manipulate yaml files using python and push to some new branch
ArgoCD can listen to this new branch
Roll out the model to production via ArgoCD.
Run a stress test to log latency and throughput.
These can be added as comment to the commit
Deploy updates to the HuggingFace Hub demo deployment
Deploy to AWS Lambda as an additional serving endpoint. might not be required if you’re pulling the model from S3

To successfully complete the capstone project, you must submit:

A comprehensive architecture diagram illustrating the entire pipeline and deployment process.
A demonstration video that walks through the end-to-end pipeline, showing each step and its impact.
The complete code repository containing: GitHub Actions configuration files. Kubernetes manifest files and Helm charts.
An exhaustive README that details:
The project's structure.
The process of data management, model training, and deployment.
Screenshots and explanations of your pipeline in action. Performance metrics, including latency and stress test results.

todo Explanation: Deployment 01

Now kubernetes deployment - github actions show the ecr repo and docker all 3 files explain secrets setup - github workflow on push to main branch train and push to stage setting up of kserve, knative, argocd, performs load test after successfull load test promotes model from stage to dev comment load test results in commit

Deployments

Note: This repo also has the procedure for ArgoCD deployment repo https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve

Dataset

Download Dataset

chmod +x shscripts/download_zips.sh && ./shscripts/download_zips.sh

Sports

data/processed/sports/
├── sports.csv
├── test
│   ├── air hockey
│   ├── .
│   ├── .
│   ├── .
│   └── wingsuit flying
├── train
│   ├── air hockey
│   ├── .
│   ├── .
│   ├── .
│   └── wingsuit flying
└── valid
    ├── air hockey
    ├── .
    ├── .
    ├── .
    └── wingsuit flying
-----------------------------------------------------
data/processed/vegfruits/

Model Development

Training a model

We have used timm models convnext_tiny and edgenext_xx_small for both classifiers. output_{project}.txt is generated after every train which helps us in getting models accuracy without evaluating.

Hparams Search

Optuna is used for hyperparameter search. Hydra and pytorch lightining is used for flexiblity.
You can refer hsports.yaml and hvegfruits.yaml for final experimentation.

Torch serve is using onnx model for deployment, so following are required

congig.properties
handler file
model
class.json file

Torchserve - Preparation file

    torch-model-archiver --model-name {project}-classifier
    --serialized-file {deploy_dir}/{project}.onnx
    --handler src/backend/torchserve_app/{project}_handler.py
    --export-path {deploy_dir}/model-store/ -f --version 0.0.1
    --extra-files {deploy_dir}/index_to_name.json

Usage for deployment

Basic train

python src/backend/torch_local/train.py experiment=hvegfruits script=true
python src/backend/torch_local/train.py experiment=hsports script=true

scripts takes care of creating onnx model in checkpoints folder

Eval

python src/backend/torch_local/eval.py experiment=evegfruits
python src/backend/torch_local/eval.py experiment=esports

Take the model and host it with fast api

host one api in 8080 and another api in 9090
python src/backend/fastapi_app/fapi_vegfruits.py
python src/backend/fastapi_app/fapi_sports.py

Move files to s3

python src/backend/torch_local/transfer_mar.py

Next JS There are two choosing buttons and each one redirects to different end points

End points can be changed in src/frontend/ui/app/predict_app1 and src/frontend/ui/app/predict_app2
npm run dev

DVC setup with AWS S3

Run aws configure and set the crednetials

Make sure data/processed/sports and data/processed/vegfruits are only present and remove other zip files

dvc remote add -d myremote s3://mybucket-emlo-mumbai/session-18-data
dvc add data

Set the S3 URL and push

dvc push -r myremote

Docker command for train and test

Training with GPU

Build a image for training. Docker file is in root folder.

docker build -t emlo-18-train -f Dockerfile .

Train a model inside docker and push to s3

docker run --gpus=all \
            --name session-18-container \
            --shm-size=8g \
            -v "$(pwd):/workspace" \
            -e AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID \
            -e AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY+AWS_SECRET_ACCESS_KEY \
            -e AWS_DEFAULT_REGION=AWS_DEFAULT_REGION \
            -e AWS_REGION=AWS_REGION \
            emlo-18-train  \
            /bin/bash -c "
            dvc pull -r myremote && \
            dvc repro -f
            python src/backend/torch_local/transfer_mar.py -p dev
            "

dvc pull -r myremote - pulls from s3 to local
dvc repro -f - Runs train and test

python src/backend/torch_local/transfer_mar.py -p dev - Transfers required files like onnx model, pt model, classes.json and accuracy information to s3 dev location or the given path in s3.

<debug>
Helpfull for debugging containers locally

docker run -it --gpus=all \
            --name session-18-container \
            --shm-size=8g \
            -v "$(pwd):/workspace" \
            -e AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID \
            -e AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY \
            -e AWS_DEFAULT_REGION=AWS_REGION \
            -e AWS_REGION=AWS_REGION \
            emlo-18-train  \
            /bin/bash 
</debug>

Deployment 01 - Kubernetes

Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times the deletion gets failed so at backend something would be running and it may cost you high.

If you are triggering a spot instance manually with peresistent type ensure that both the spot request is cancelled manually and the AWS instance is terminated finally

Note: This repo also has the procedure for ArgoCD deployment repo https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve

It took two months of effort in every weekend to complete this as a group and it may take you atleast 20$ of AWS cost if you are debugging and developing with AWS for this deployment alone. if you just going to test it , it may cost 0.05 to 1 dollar with github actions.

Architecture Diagram

Manual Setup

D01 Installation

AWS install

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Provide credentials

aws configure

EKSCTL Install

# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH

curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"

# (Optional) Verify checksum
curl -sL "<https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt>" | grep $PLATFORM | sha256sum --check

tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz

sudo mv /tmp/eksctl /usr/local/bin

Set the default ssh-gen key in local

This default ssh key is used by aws for default ssh login

ssh-keygen -t rsa -b 4096

Install kubectl for aws eks in your local

curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.32.0/2024-12-20/bin/linux/amd64/kubectl

chmod +x ./kubectl

mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH

Docker

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

ArgoCD

curl -sSL -o argocd-linux-amd64 https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
sudo install -m 555 argocd-linux-amd64 /usr/local/bin/argocd
# rm argocd-linux-amd64

Docker images to ECR

All below docker images are present here

Build and push docker images to AWS ECR

Model server

Name: model-onnx-server

Used pytorch/torchserve-kfs:0.12.0 and installed onnx dependencies. so that K-Serve and Kubernetes inference services can be served with ONNX model also with light weight.

Web server

Name: web-server

docker build -t web-server -f Dockerfile.web-server .

UI server

Name: ui-server

docker build -t ui-server -f Dockerfile.ui-server .

All above images test in ports like 8080 or 9090 in local and then change to port 80, else its hard to debug in local.

Use below commands to debug web server and ui server

docker run -it --network my_network -v ./emlo4-s18/E2EMLOps/K8SDeploy/src/web-server:/opt/src -p9090:9090 web-server bash
uvicorn server:app --host 0.0.0.0 --port 9090

ECR Repo

Push all images to ECR Repo

Cluster creation and configuration

Cluster creation

Note: Comment out the publicKeyPath: /root/.ssh/id_rsa.pub in cluster file if you are doing in your local as it will take the default ssh file from your local.

eksctl create cluster -f eks-cluster.yaml

Metric server config fix

kubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml --validate=false

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

kubectl patch deployment metrics-server -n kube-system --type='json' -p='[
    {
        "op": "add",
        "path": "/spec/template/spec/hostNetwork",
        "value": true
    },
    {
        "op": "replace",
        "path": "/spec/template/spec/containers/0/args",
        "value": [
        "--cert-dir=/tmp",
        "--secure-port=4443",
        "--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname",
        "--kubelet-use-node-status-port",
        "--metric-resolution=15s",
        "--kubelet-insecure-tls"
        ]
    },
    {
        "op": "replace",
        "path": "/spec/template/spec/containers/0/ports/0/containerPort",
        "value": 4443
    }
    ]'

Reference:

Metric-Server installaiton reference

KNative

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-core.yaml

ISTIO

kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.16.0/istio.yaml

kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.16.0/net-istio.yaml

Patching

kubectl patch configmap/config-domain \
    --namespace knative-serving \
    --type merge \
    --patch '{"data":{"emlo.tsai":""}}'

kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-hpa.yaml
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml

Wait and check if everything is up and running

kubectl get all -n cert-manager

KNative Serving

kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.1/kserve.yaml

Wait and check if all pods are up

kubectl get all -n kserve
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.1/kserve-cluster-resources.yaml

S3 Access creation

eksctl create iamserviceaccount --cluster=basic-cluster-1 --name=s3-read-only --attach-policy-arn=arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess  --override-existing-serviceaccounts --region ap-south-1 --approve

kubectl apply -f s3-secret.yaml
kubectl patch serviceaccount s3-read-only -p '{"secrets": [{"name": "s3-secret"}]}'

Dashboard, ALB, Charts

Kubernetes Dashboard

helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard

kubectl label namespace default istio-injection=enabled

ALB

eksctl create iamserviceaccount \
    --cluster=basic-cluster-1 \
    --namespace=kube-system \
    --name=aws-load-balancer-controller \
    --attach-policy-arn=arn:aws:iam::306093656765:policy/AWSLoadBalancerControllerIAMPolicy \
    --override-existing-serviceaccounts \
    --region ap-south-1 \
    --approve

EKS Charts

helm repo add eks https://aws.github.io/eks-charts
helm repo update

helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=basic-cluster-1 --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

Test URL

kubectl get pods,svc -n istio-system

The URL that appears in above command can also be seen in Loadbalancers apge in AWS

Test if sports-classifier works fine with sports s3 models

kubectl apply -f sports-classifier.yaml

Check if everything works and delete it, we need to setup prometheus and grafana

python test_kserve_sports_2.py

Ingress details

kubectl get isvc
kubectl get svc -n istio-system

Delete classifier after testing

kubectl delete -f sports-classifier.yaml

Till here you can know if everything works fine or not with python test_kserve_sports_2.py

Prometheus

cd other-setup
git clone --branch release-0.14 https://github.com/kserve/kserve.git
cd kserve
kubectl apply -k docs/samples/metrics-and-monitoring/prometheus-operator
kubectl wait --for condition=established --timeout=120s crd/prometheuses.monitoring.coreos.com
kubectl wait --for condition=established --timeout=120s crd/servicemonitors.monitoring.coreos.com
kubectl apply -k docs/samples/metrics-and-monitoring/prometheus

cd ..
kubectl patch configmaps -n knative-serving config-deployment --patch-file qpext_image_patch.yaml

Set max nodes because if you give more request and max is not set it may scale more Optional step

<debug>
eksctl scale nodegroup --cluster=basic-cluster --nodes=6 ng-spot-3 --nodes-max=6
eksctl get nodegroup --cluster basic-cluster --region ap-south-1 --name ng-spot-3
</debug>

kubectl port-forward service/prometheus-operated -n kfserving-monitoring 9090:9090

Grafana

kubectl create namespace grafana

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana --namespace grafana --version 8.8.4

kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
kubectl port-forward svc/grafana 3000:80 -n grafana

Setup dashboard for grafana

Go to Connections-> Add data source -> Prometheus -> Add this prometheus url http://prometheus-operated.kfserving-monitoring.svc.cluster.local:9090 -> save and test
Go to Dashboards -> New -> import -> download the json file from here https://grafana.com/grafana/dashboards/18032-knative-serving-revision-http-requests/ -> upload the json to the specified place

verify

kubectl get isvc

ArgoCD Deployment

Now clone this repo and get inside it

You can also perform canary deployment with this same setup by adding 'canary:30in thecanary-argocd-kserve` repo

ArgoCD setup

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

Get Argocd password for login

argocd admin initial-password -n argocd

Check if you are able to access the UI, in codespaces i am not able to forward and access for argocd UI but in local its working

kubectl port-forward svc/argocd-server -n argocd 8080:443

Argo CD deployment

Note: I have used canary-argocd-kserve-repo repo for deployment of argoCD apps. Please refer to https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve for argocd repo structure

Have s3-secret.yaml file in argo-apps/s3-secret.yaml folder and update it with your AWS credentails i.e AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

apiVersion: v1
kind: Secret
metadata:
  name: s3creds
  annotations:
     serving.kserve.io/s3-endpoint: s3.ap-south-1.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
     serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
     serving.kserve.io/s3-region: "ap-south-1"
     serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
type: Opaque
stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
  AWS_ACCESS_KEY_ID: AKXXXXXXXXXXXXXXXXXXXXX
  AWS_SECRET_ACCESS_KEY: "RQHBUNBSJNINQONUKNUKXXXXXX+XQIWOW"

---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-read-only
secrets:
- name: s3creds

Create the argocd repo before you start and update the repo url in argo-apps/models.yaml file

Now commit the load balancer url to emlo4-s18/emlo4-session-18-ajithvcoder-canary-argocd-kserve/fastapi-helm/templates/model-server.cm.yml file

Change both model-1 and model-2 url

Deploy Command

kubectl apply -f argo-apps

<debug>
Delete argocd deployments
kubectl get app -n argocd
kubectl patch app fastapi-helm   -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd
kubectl delete app fastapi-helm  -n argocd
</debug>

Verify the pods

kubectl get pods

Load testing with promethes and Grafana

Load testing Veg fruits model

python3 utils/test_load_vegfruits_2.py

Load testing Sports model

python3 utils/test_load_sports_2.py

Deletion Procedure

Delete argocd deployments

Verify app name

kubectl get app -n argocd

Delete cascade

kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd
kubectl delete app model-deployments -n argocd

Deletion of cluster

eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction

Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times the deletion gets failed so at backend something would be running and it may cost you high

If you are triggering a spot instance manually with peresistent type ensure that both the spot request is cancelled manually and the AWS instance is terminated finally

Results Screenshots - D01 Argo CD Manual Deployment

Ports establishment

Argo CD UI

Pods getting initialized

Load balancers after deployment

For Model serving
For UI serving

Change ALB URLs

*Note: In github actions deployment its automatically done with github bots.

Argo CD Commit

Pod running

UI Prediction

Post request

Grafana Dashboards

Load test - Scale up and down

Metric "Request per second" of 5 is the limit and pod scales up when it goes above 5.
Min and max pod scaling are 1 to 3

Load request to model 1

Scale up

Scale down

Load request to model 2

Response time - Load test

Sports Model

Vegfruits Model

Github actions - Automaing Manual setup End to End - D01

Two Workflows

On Push to Main branch - Main Deployment workflow gets triggered which calls 01A_Deployment_Train_And_Store.yaml and 01B_Deployment_ArgoCD_And_Promote.yaml

01A Deployment_Train_And_Store

a. Internally they trigger a build and push current code with requirements as image to a18/torch-lightning-train repo in ECR

b. Triggers a spot g4dn.xlarge 16 GB GPU instance with 128 volume and pytorch 2.6 AMI

c. Pulls the image from ECR repo, downloads dataset, Trains and pushes the model to s3-stage location

01B Deployment_ArgoCD_And_Promote

a. Triggers a ec2 t3a.xlarge as host instance

b. Setups up a cluster with 3 spot and 1 dedicated instance

c. Performs all necessary setup for Kserve, ALB, Argo CD, Prometheus, Grafana. Kserve takes care of HPA scaling as it internally has it

d. Fetches the model from s3-stage and deploys it

e. performs load test on both the models and if the success rate is more than 90% it moves the model to s3-prod
On Pull Request to Main branch - PR Accuracy Test workflow gets triggered with calls 01D_Deployment_PR_Train_And_Store.yaml

a. Triggers a spot g4dn.xlarge 16 GB GPU instance with 128 volume and pytorch 2.6 AMI

b. Pulls the image from ECR repo, downloads dataset, Trains and pushes the model to s3-dev location

c. Checks if the model is acheving greater than the accuracy than s3-prod, if so makes a comment

There are 3 repos

a18/ui-server - Next JS UI
a18/web-server - Fast API web server
model-onnx-server - For serving Kserve + HPA model

Create following 9 secrets

*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement

Results Screenshots - Github Actions 01A and 01B Deployment Main

Main Deployment workflow

On push to main branch triggered
Train the model and push to s3 stage folder
Setups argo cd and deploy two models with HPA

Fetchs the loabalancer URL and commits to Argo CD repo

Live Model
Deploys the argo cd app
After successfull load test promotes model from s3-stage to s3-dev - (screenshot)

Model 1

Model 2
Comment after load test results in tested commit - (screenshot)

Github actions workflow Link - D01A-D01B

D01 - Kubernetes - Main - Final workflow Result

Results Screenshots - D01C Github actions On PR Request

When a new PR is made this workflow starts to execute. it trains and then pushes the model to s3

PR Accuracy Test workflow

Overall actions workflow
Builds the current code and pushes to ECR repo
Starts a GPU Spot instance
Pulls dataset, Trains a model, Compares the accuracy with prod model.

A output_{project_name}.txt will be generated after every train and it is also available everywhere where ever the model is copies. Thus is can be used for comparision rather than evaluating everytime.
Stops EC2 Runner
On PR Request generates a comment

Github actions workflow Link - D01C

D01C - Kubernetes - On PR - Final workflow Result

D01 - Deployment - Video Link

(Kindly run at 2x as its a long video. Results are at end fo the video)

Deployment 02 - Hugging face

*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement.

Files are in gradio_deploy/ folder

Steps for hugging face deployment

S3 folder where model is present
Develop a gradio app,
create a space in hugging face
create write token as through github actions we need to push model and code to hugging faces
Add the AWS secrets to Hugging face space.

Set hugging face credentials in the enviornment

Now push the files to hugging face space by huggingface_hub package.

Local testing

python app.py

app.py is a gradio app that takes model files from s3 and then uses it for prediction

Hugging face space repo

emlov4-session-18-deployment-hf

Results Screenshots - D02 - HuggingFace

Hugging face Repo Deployments

Deployment on hugging face environemnt

Github actions deployment

Deployment 02 Github Actions Hugging Face Workflow Link

Live Link - D02

Live Deployed D02 Hugging faceLink

Deployment 03 - AWS Lambda

Note: Kindly check AWS Cloud formations to verify that everything is deleted

*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement.

Explanation: First local deployment of Gradio is done for sports classifier and veg fruits classifier and tested with docker. Prodcution grade models are fetched from s3 and kept in "s3_files" folder for both deplpyment. Further they are added to docker container and CDK package is used to push the image to ECR and create lambda service stack.

Files are in LambdaFn/ folder

D03 - Installations

NPM and Node

sudo apt install npm
sudo npm install -g aws-cdk
cdk --version

cdk using python note: pwd:= ParentDirectory

source .venv/bin/activate
python -m pip install -r aws-req.txt

D03 - Usage

Local test for gradio

python app.py

Deploy in AWS

cdk bootstrap aws://${CDK_DEFAULT_ACCOUNT}/${CDK_DEFAULT_REGION}
cdk deploy --require-approval=never
cdk destroy - Destroy all resources created

Kindly check "AWS Cloud formations" to verify that everything is deleted

D03 - Docker Image Creation

Build

docker build -t lambdafn -f Dockerfile.lambdafn . --no-cache

docker run -v /workspaces/E2EMLOps/LambdaFn:/var/task -it -p 8080:8080 lambdafn bash

<debug>
1. Use to clean the docker builds locally
docker builder prune --all --force

2. Dont use COPY . ., there is cdk.out folder which will store the docker images also it will cost around 4B+
so copy only the required files

</debug>

Results Screenshots - D03

Github actions deployment

Files are downloaded from s3-prod folder and added to docker container CDK package is used to push the image to ECR and lmabda service
Sports model prediction
Vegetable-fruits model prediction

Github actions workflow Link - D03

D03 - Lambda - Final run workflow link

Live Link - D03

Live Deployed AWS Lambda Link

Learnings

Learnt about making automated deployment with github actions
If we build a gpu image from github runner then its throwing below error, so i added new privalges to use host gpu.

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

I learnt a lot about Github actions, Kubernetes services in past few assignments and i learnt that its good to spent some money for AWS services, similar to how we pay fees for courses. We get some courses for 20k, 50k but we hesitate to spend 500 ro 1000 rupees for doing some practical work, it has helped me. I spent 10k on all 18 assignments, it has helped me in handling kubernetes issues in my office.

*Mostly other stuffs are already done in previous assignments and this assignment is a combination of everything.

Technologies Used

PyTorch Lightning - a lightweight PyTorch wrapper for high-performance AI research. Think of it as a framework for organizing your PyTorch code.

Hydra - a framework for elegantly configuring complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.

DVC - A tool designed to handle large datasets and machine learning models in a version-controlled workflow

Tensorboard|wandb - TensorBoard is a tool that provides visualization and debugging capabilities for TensorFlow and PyTorch experiments. It’s a popular choice for monitoring machine learning training processes in real time.

AWS|EC2|S3|Lambda|ECR - AWS Elastic Compute Cloud (EC2) is a service that provides scalable virtual computing resources in the cloud.

Docker - A platform for creating, deploying, and managing lightweight, portable, and scalable containers.

FastAPI|Gradio - A Python library for building simple, interactive web interfaces for machine learning models and APIs.

Nextjs - Frontend FrameWork

K8s|KNative|Kserve|Istio|ArgoCD - AWS Kubernets and ArgoCD

Prometheus|Grafana - observability

Group Members

Ajith Kumar V (myself)
Hema M
Muthukamalan
Abhiya gupta

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
.dvc		.dvc
.github		.github
Dockerspool		Dockerspool
K8SDeploy		K8SDeploy
LambdaFn		LambdaFn
assets		assets
checkpoints		checkpoints
configs		configs
gradio_deploy		gradio_deploy
lightning_logs		lightning_logs
logs		logs
multirun		multirun
notebooks		notebooks
outputs		outputs
shscripts		shscripts
src		src
testing/loadtest		testing/loadtest
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.env.local		.env.local
.gitignore		.gitignore
.project-root		.project-root
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
data.dvc		data.dvc
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
project_structure.txt		project_structure.txt
pyproject.toml		pyproject.toml
requirements.cpu.txt		requirements.cpu.txt
requirements.gpu.txt		requirements.gpu.txt
uv.lock		uv.lock

License

ajithvcoder/E2EMLOps

Folders and files

Latest commit

History

Repository files navigation