Following are the deployments made in this Capstone assignment
-
Deployment 01 - Kubernetes - Video Link - Train a model with vegetable fruits and sports dataset and perform a argo cd deployment with load testing with github actions. Workflow Result 1 , Workflow Result 2
-
Deployment 02 - Hugging Face - Perform a hugging face deployment with Gradio with github actions. Live Link
-
Deployment 03 - AWS Lambda - Live Link - Perform a AWS Lambda deployment with Gradio with github actions.
Note: Video link and 2 live urls are available above which are Capstone assignment deliverables. (Kindly run at 2x as its a long video. Results are at end fo the video)
- Requirements
- Deliverables
- Deployments
- Learnings
- Technologies Used
- On Pull Request to the main branch:
- Trigger a model training process (using EC2 or GitHub Actions Runner).
- Compare the evaluation metrics of the freshly trained model with those of the current production model.
- On Push to the main branch:
- Retrain the model using the latest dataset.
- Store the updated model in S3. - stage - env
- Update Kubernetes manifest files to reference the new model. - stage
- You can manipulate yaml files using python and push to some new branch
- ArgoCD can listen to this new branch
- Roll out the model to production via ArgoCD.
- Run a stress test to log latency and throughput.
- These can be added as comment to the commit
- Deploy updates to the HuggingFace Hub demo deployment
- Deploy to AWS Lambda as an additional serving endpoint. might not be required if you’re pulling the model from S3
To successfully complete the capstone project, you must submit:
- A comprehensive architecture diagram illustrating the entire pipeline and deployment process.
- A demonstration video that walks through the end-to-end pipeline, showing each step and its impact.
- The complete code repository containing: GitHub Actions configuration files. Kubernetes manifest files and Helm charts.
- An exhaustive README that details:
- The project's structure.
- The process of data management, model training, and deployment.
- Screenshots and explanations of your pipeline in action. Performance metrics, including latency and stress test results.
todo Explanation: Deployment 01
Now kubernetes deployment - github actions show the ecr repo and docker all 3 files explain secrets setup - github workflow on push to main branch train and push to stage setting up of kserve, knative, argocd, performs load test after successfull load test promotes model from stage to dev comment load test results in commit
Note: This repo also has the procedure for ArgoCD deployment repo https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve
Download Dataset
chmod +x shscripts/download_zips.sh && ./shscripts/download_zips.sh
Sports
data/processed/sports/
├── sports.csv
├── test
│ ├── air hockey
│ ├── .
│ ├── .
│ ├── .
│ └── wingsuit flying
├── train
│ ├── air hockey
│ ├── .
│ ├── .
│ ├── .
│ └── wingsuit flying
└── valid
├── air hockey
├── .
├── .
├── .
└── wingsuit flying
-----------------------------------------------------
data/processed/vegfruits/
Training a model
We have used timm models convnext_tiny
and edgenext_xx_small
for both classifiers. output_{project}.txt
is generated after every train which helps us in getting models accuracy without evaluating.
Hparams Search
- Optuna is used for hyperparameter search. Hydra and pytorch lightining is used for flexiblity.
- You can refer
hsports.yaml
andhvegfruits.yaml
for final experimentation.
Torch serve is using onnx model for deployment, so following are required
- congig.properties
- handler file
- model
- class.json file
Torchserve - Preparation file
torch-model-archiver --model-name {project}-classifier
--serialized-file {deploy_dir}/{project}.onnx
--handler src/backend/torchserve_app/{project}_handler.py
--export-path {deploy_dir}/model-store/ -f --version 0.0.1
--extra-files {deploy_dir}/index_to_name.json
Basic train
python src/backend/torch_local/train.py experiment=hvegfruits script=true
python src/backend/torch_local/train.py experiment=hsports script=true
scripts takes care of creating onnx model in checkpoints folder
Eval
python src/backend/torch_local/eval.py experiment=evegfruits
python src/backend/torch_local/eval.py experiment=esports
Take the model and host it with fast api
-
host one api in 8080 and another api in 9090
-
python src/backend/fastapi_app/fapi_vegfruits.py
-
python src/backend/fastapi_app/fapi_sports.py
Move files to s3
python src/backend/torch_local/transfer_mar.py
Next JS There are two choosing buttons and each one redirects to different end points
-
End points can be changed in
src/frontend/ui/app/predict_app1
andsrc/frontend/ui/app/predict_app2
-
npm run dev
Run aws configure
and set the crednetials
Make sure data/processed/sports
and data/processed/vegfruits
are only present and remove other zip files
dvc remote add -d myremote s3://mybucket-emlo-mumbai/session-18-data
dvc add data
Set the S3 URL and push
dvc push -r myremote
Training with GPU
-
Build a image for training. Docker file is in root folder.
docker build -t emlo-18-train -f Dockerfile .
-
Train a model inside docker and push to s3
docker run --gpus=all \ --name session-18-container \ --shm-size=8g \ -v "$(pwd):/workspace" \ -e AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY+AWS_SECRET_ACCESS_KEY \ -e AWS_DEFAULT_REGION=AWS_DEFAULT_REGION \ -e AWS_REGION=AWS_REGION \ emlo-18-train \ /bin/bash -c " dvc pull -r myremote && \ dvc repro -f python src/backend/torch_local/transfer_mar.py -p dev "
-
dvc pull -r myremote
- pulls from s3 to local -
dvc repro -f
- Runs train and test -
python src/backend/torch_local/transfer_mar.py -p dev
- Transfers required files like onnx model, pt model, classes.json and accuracy information tos3 dev
location or the given path in s3.<debug> Helpfull for debugging containers locally docker run -it --gpus=all \ --name session-18-container \ --shm-size=8g \ -v "$(pwd):/workspace" \ -e AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY \ -e AWS_DEFAULT_REGION=AWS_REGION \ -e AWS_REGION=AWS_REGION \ emlo-18-train \ /bin/bash </debug>
Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times the deletion gets failed so at backend something would be running and it may cost you high.
If you are triggering a spot instance manually with peresistent type ensure that both the spot request is cancelled manually and the AWS instance is terminated finally
Note: This repo also has the procedure for ArgoCD deployment repo https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve
It took two months of effort in every weekend to complete this as a group and it may take you atleast 20$ of AWS cost if you are debugging and developing with AWS for this deployment alone. if you just going to test it , it may cost 0.05 to 1 dollar with github actions.
AWS install
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
Provide credentials
aws configure
EKSCTL Install
# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
# (Optional) Verify checksum
curl -sL "<https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt>" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
Set the default ssh-gen key in local
This default ssh key is used by aws for default ssh login
ssh-keygen -t rsa -b 4096
Install kubectl for aws eks in your local
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.32.0/2024-12-20/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
ArgoCD
curl -sSL -o argocd-linux-amd64 https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
sudo install -m 555 argocd-linux-amd64 /usr/local/bin/argocd
# rm argocd-linux-amd64
All below docker images are present here
- Build and push docker images to AWS ECR
Model server
Name: model-onnx-server
- Used
pytorch/torchserve-kfs:0.12.0
and installed onnx dependencies. so that K-Serve and Kubernetes inference services can be served with ONNX model also with light weight.
Web server
Name: web-server
docker build -t web-server -f Dockerfile.web-server .
UI server
Name: ui-server
docker build -t ui-server -f Dockerfile.ui-server .
All above images test in ports like 8080 or 9090 in local and then change to port 80, else its hard to debug in local.
Use below commands to debug web server and ui server
-
docker run -it --network my_network -v ./emlo4-s18/E2EMLOps/K8SDeploy/src/web-server:/opt/src -p9090:9090 web-server bash
-
uvicorn server:app --host 0.0.0.0 --port 9090
ECR Repo
Cluster creation
Note: Comment out the publicKeyPath: /root/.ssh/id_rsa.pub
in cluster file if you are doing in your local as it will take the default ssh file
from your local.
Metric server config fix
-
kubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml --validate=false
-
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch deployment metrics-server -n kube-system --type='json' -p='[ { "op": "add", "path": "/spec/template/spec/hostNetwork", "value": true }, { "op": "replace", "path": "/spec/template/spec/containers/0/args", "value": [ "--cert-dir=/tmp", "--secure-port=4443", "--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname", "--kubelet-use-node-status-port", "--metric-resolution=15s", "--kubelet-insecure-tls" ] }, { "op": "replace", "path": "/spec/template/spec/containers/0/ports/0/containerPort", "value": 4443 } ]'
Reference:
KNative
-
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-crds.yaml
-
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-core.yaml
ISTIO
-
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.16.0/istio.yaml
-
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.16.0/net-istio.yaml
Patching
kubectl patch configmap/config-domain \ --namespace knative-serving \ --type merge \ --patch '{"data":{"emlo.tsai":""}}'
-
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-hpa.yaml
-
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml
Wait and check if everything is up and running
kubectl get all -n cert-manager
KNative Serving
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.1/kserve.yaml
Wait and check if all pods are up
-
kubectl get all -n kserve
-
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.1/kserve-cluster-resources.yaml
S3 Access creation
eksctl create iamserviceaccount --cluster=basic-cluster-1 --name=s3-read-only --attach-policy-arn=arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess --override-existing-serviceaccounts --region ap-south-1 --approve
-
kubectl apply -f s3-secret.yaml
-
kubectl patch serviceaccount s3-read-only -p '{"secrets": [{"name": "s3-secret"}]}'
Dashboard, ALB, Charts
Kubernetes Dashboard
helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/
helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard
kubectl label namespace default istio-injection=enabled
ALB
eksctl create iamserviceaccount \
--cluster=basic-cluster-1 \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::306093656765:policy/AWSLoadBalancerControllerIAMPolicy \
--override-existing-serviceaccounts \
--region ap-south-1 \
--approve
EKS Charts
helm repo add eks https://aws.github.io/eks-charts
helm repo update
helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=basic-cluster-1 --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller
Test URL
kubectl get pods,svc -n istio-system
The URL that appears in above command can also be seen in Loadbalancers
apge in AWS
Test if sports-classifier
works fine with sports s3 models
kubectl apply -f sports-classifier.yaml
Check if everything works and delete it, we need to setup prometheus and grafana
python test_kserve_sports_2.py
Ingress details
kubectl get isvc
kubectl get svc -n istio-system
Delete classifier after testing
kubectl delete -f sports-classifier.yaml
Till here you can know if everything works fine or not with python test_kserve_sports_2.py
Prometheus
cd other-setup
git clone --branch release-0.14 https://github.com/kserve/kserve.git
cd kserve
kubectl apply -k docs/samples/metrics-and-monitoring/prometheus-operator
kubectl wait --for condition=established --timeout=120s crd/prometheuses.monitoring.coreos.com
kubectl wait --for condition=established --timeout=120s crd/servicemonitors.monitoring.coreos.com
kubectl apply -k docs/samples/metrics-and-monitoring/prometheus
cd ..
kubectl patch configmaps -n knative-serving config-deployment --patch-file qpext_image_patch.yaml
Set max nodes because if you give more request and max is not set it may scale more Optional step
<debug>
eksctl scale nodegroup --cluster=basic-cluster --nodes=6 ng-spot-3 --nodes-max=6
eksctl get nodegroup --cluster basic-cluster --region ap-south-1 --name ng-spot-3
</debug>
kubectl port-forward service/prometheus-operated -n kfserving-monitoring 9090:9090
Grafana
kubectl create namespace grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install grafana grafana/grafana --namespace grafana --version 8.8.4
kubectl get secret --namespace grafana grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
kubectl port-forward svc/grafana 3000:80 -n grafana
Setup dashboard for grafana
-
Go to Connections-> Add data source -> Prometheus -> Add this prometheus url
http://prometheus-operated.kfserving-monitoring.svc.cluster.local:9090
-> save and test -
Go to Dashboards -> New -> import -> download the json file from here
https://grafana.com/grafana/dashboards/18032-knative-serving-revision-http-requests/
-> upload the json to the specified place
verify
kubectl get isvc
Now clone this repo and get inside it
You can also perform canary deployment with this same setup by adding 'canary:30in the
canary-argocd-kserve` repo
ArgoCD setup
-
kubectl create namespace argocd
-
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
Get Argocd password for login
argocd admin initial-password -n argocd
Check if you are able to access the UI, in codespaces i am not able to forward and access for argocd UI but in local its working
kubectl port-forward svc/argocd-server -n argocd 8080:443
Argo CD deployment
Note: I have used canary-argocd-kserve-repo repo for deployment of argoCD apps. Please refer to https://github.com/ajithvcoder/emlo4-session-18-ajithvcoder-canary-argocd-kserve
for argocd repo structure
- Have s3-secret.yaml file in argo-apps/s3-secret.yaml folder and update it with your AWS credentails i.e
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
apiVersion: v1
kind: Secret
metadata:
name: s3creds
annotations:
serving.kserve.io/s3-endpoint: s3.ap-south-1.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
serving.kserve.io/s3-region: "ap-south-1"
serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
type: Opaque
stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
AWS_ACCESS_KEY_ID: AKXXXXXXXXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY: "RQHBUNBSJNINQONUKNUKXXXXXX+XQIWOW"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-read-only
secrets:
- name: s3creds
- Create the argocd repo before you start and update the repo url in
argo-apps/models.yaml
file
Now commit the load balancer url to emlo4-s18/emlo4-session-18-ajithvcoder-canary-argocd-kserve/fastapi-helm/templates/model-server.cm.yml
file
Change both model-1 and model-2 url
Deploy Command
kubectl apply -f argo-apps
<debug>
Delete argocd deployments
kubectl get app -n argocd
kubectl patch app fastapi-helm -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd
kubectl delete app fastapi-helm -n argocd
</debug>
Verify the pods
kubectl get pods
Load testing Veg fruits model
python3 utils/test_load_vegfruits_2.py
Load testing Sports model
python3 utils/test_load_sports_2.py
Delete argocd deployments
Verify app name
kubectl get app -n argocd
Delete cascade
kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd
kubectl delete app model-deployments -n argocd
Deletion of cluster
eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction
Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times the deletion gets failed so at backend something would be running and it may cost you high
If you are triggering a spot instance manually with peresistent
type ensure that both the spot request is cancelled manually
and the AWS instance is terminated finally
Ports establishment
Argo CD UI
Pods getting initialized
Load balancers after deployment
- For Model serving
- For UI serving
Change ALB URLs
*Note: In github actions deployment its automatically done with github bots.
Argo CD Commit
Pod running
UI Prediction
Post request
Grafana Dashboards
Load test - Scale up and down
-
Metric "Request per second" of 5 is the limit and pod scales up when it goes above 5.
-
Min and max pod scaling are 1 to 3
Load request to model 1
Scale up
Scale down
Load request to model 2
Response time - Load test
Sports Model
Vegfruits Model
Two Workflows
-
On Push to Main branch - Main Deployment workflow gets triggered which calls
01A_Deployment_Train_And_Store.yaml
and01B_Deployment_ArgoCD_And_Promote.yaml
01A Deployment_Train_And_Store
a. Internally they trigger a build and push current code with requirements as image to
a18/torch-lightning-train
repo in ECRb. Triggers a spot
g4dn.xlarge
16 GB
GPU instance with 128 volume andpytorch 2.6 AMI
c. Pulls the image from ECR repo, downloads dataset, Trains and pushes the model to
s3-stage
location01B Deployment_ArgoCD_And_Promote
a. Triggers a ec2 t3a.xlarge as host instance
b. Setups up a cluster with 3 spot and 1 dedicated instance
c. Performs all necessary setup for Kserve, ALB, Argo CD, Prometheus, Grafana. Kserve takes care of HPA scaling as it internally has it
d. Fetches the model from s3-stage and deploys it
e. performs load test on both the models and if the success rate is more than 90% it moves the model to
s3-prod
-
On Pull Request to Main branch - PR Accuracy Test workflow gets triggered with calls
01D_Deployment_PR_Train_And_Store.yaml
a. Triggers a spot
g4dn.xlarge
16 GB
GPU instance with 128 volume andpytorch 2.6 AMI
b. Pulls the image from ECR repo, downloads dataset, Trains and pushes the model to
s3-dev
locationc. Checks if the model is acheving greater than the accuracy than
s3-prod
, if so makes a comment
There are 3 repos
- a18/ui-server - Next JS UI
- a18/web-server - Fast API web server
- model-onnx-server - For serving Kserve + HPA model
*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement
Main Deployment workflow
-
On push to main branch triggered
-
Train the model and push to s3
stage
folder -
Setups argo cd and deploy two models with HPA
Fetchs the loabalancer URL and commits to Argo CD repo
Live Model
-
Deploys the argo cd app
-
After successfull load test promotes model from s3-stage to s3-dev - (screenshot)
Model 1
Model 2
-
Comment after load test results in tested commit - (screenshot)
When a new PR is made this workflow starts to execute. it trains and then pushes the model to s3
PR Accuracy Test workflow
-
Overall actions workflow
-
Builds the current code and pushes to ECR repo
-
Starts a GPU Spot instance
-
Pulls dataset, Trains a model, Compares the accuracy with prod model.
A output_{project_name}.txt will be generated after every train and it is also available everywhere where ever the model is copies. Thus is can be used for comparision rather than evaluating everytime.
-
Stops EC2 Runner
-
On PR Request generates a comment
(Kindly run at 2x as its a long video. Results are at end fo the video)
*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement.
Files are in gradio_deploy/
folder
- S3 folder where model is present
- Develop a gradio app,
- create a space in hugging face
- create write token as through github actions we need to push model and code to hugging faces
- Add the AWS secrets to Hugging face space.
Set hugging face credentials in the enviornment
Now push the files to hugging face space by huggingface_hub
package.
Local testing
python app.py
app.py is a gradio app that takes model files from s3 and then uses it for prediction
Hugging face space repo
Hugging face Repo Deployments
Github actions deployment
Live Deployed D02 Hugging faceLink
Note: Kindly check AWS Cloud formations
to verify that everything is deleted
*Note: After training for deployment 1 we save either at {project_name}-dev, {project_name}-stage, {project_name}-prod depending on the use case in s3 bucket. For deployment-2 and deployment-3 we use {project-name}-prod as the model source for deployement.
Explanation: First local deployment of Gradio is done for sports classifier and veg fruits classifier and tested with docker. Prodcution grade models are fetched from s3 and kept in "s3_files" folder for both deplpyment. Further they are added to docker container and CDK package is used to push the image to ECR and create lambda service stack.
Files are in LambdaFn/
folder
NPM and Node
sudo apt install npm
sudo npm install -g aws-cdk
cdk --version
cdk using python
note
: pwd:= ParentDirectory
source .venv/bin/activate
python -m pip install -r aws-req.txt
Local test for gradio
python app.py
Deploy in AWS
-
cdk bootstrap aws://${CDK_DEFAULT_ACCOUNT}/${CDK_DEFAULT_REGION}
-
cdk deploy --require-approval=never
-
cdk destroy
- Destroy all resources created
Kindly check "AWS Cloud formations" to verify that everything is deleted
Build
-
docker build -t lambdafn -f Dockerfile.lambdafn . --no-cache
-
docker run -v /workspaces/E2EMLOps/LambdaFn:/var/task -it -p 8080:8080 lambdafn bash
<debug> 1. Use to clean the docker builds locally docker builder prune --all --force 2. Dont use COPY . ., there is cdk.out folder which will store the docker images also it will cost around 4B+ so copy only the required files </debug>
Github actions deployment
-
Files are downloaded from s3-prod folder and added to docker container CDK package is used to push the image to ECR and lmabda service
-
Sports model prediction
-
Vegetable-fruits model prediction
-
Learnt about making automated deployment with github actions
-
If we build a gpu image from github runner then its throwing below error, so i added new privalges to use host gpu.
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
- I learnt a lot about Github actions, Kubernetes services in past few assignments and i learnt that its good to spent some money for AWS services, similar to how we pay fees for courses. We get some courses for 20k, 50k but we hesitate to spend 500 ro 1000 rupees for doing some practical work, it has helped me. I spent 10k on all 18 assignments, it has helped me in handling kubernetes issues in my office.
*Mostly other stuffs are already done in previous assignments and this assignment is a combination of everything.
PyTorch Lightning - a lightweight PyTorch wrapper for high-performance AI research. Think of it as a framework for organizing your PyTorch code.
Hydra - a framework for elegantly configuring complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.
DVC - A tool designed to handle large datasets and machine learning models in a version-controlled workflow
Tensorboard|wandb - TensorBoard is a tool that provides visualization and debugging capabilities for TensorFlow and PyTorch experiments. It’s a popular choice for monitoring machine learning training processes in real time.
AWS|EC2|S3|Lambda|ECR - AWS Elastic Compute Cloud (EC2) is a service that provides scalable virtual computing resources in the cloud.
Docker - A platform for creating, deploying, and managing lightweight, portable, and scalable containers.
FastAPI|Gradio - A Python library for building simple, interactive web interfaces for machine learning models and APIs.
Nextjs - Frontend FrameWork
K8s|KNative|Kserve|Istio|ArgoCD - AWS Kubernets and ArgoCD
Prometheus|Grafana - observability
Group Members
-
Ajith Kumar V (myself)
-
Hema M
-
Muthukamalan
-
Abhiya gupta