Skip to content

Commit 85b0676

Browse files
committed
update readme
1 parent 4da1e31 commit 85b0676

File tree

1 file changed

+89
-85
lines changed

1 file changed

+89
-85
lines changed

README.md

Lines changed: 89 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -11,29 +11,35 @@ Following are the deployments made in this Capstone assignment
1111
### Contents
1212

1313
- [Requirements](#requirements)
14-
- [Deployments](#Deployments)
15-
- [Dataset]
16-
- [Model Development]
17-
- [DVC setup with AWS S3]
18-
- [Docker command for train and test]
19-
- [Deployment 01 - Kubernetes]
20-
- [Manual setup]
21-
- [Architecture Diagram]
22-
- [Installation]
23-
- [Cluster creation and configuration]
24-
- [ArgoCD Deployment]
25-
- [Load testing with promethes and Grafana]
14+
- [Deployments](#deployments)
15+
- [Dataset](#dataset)
16+
- [Model Development](#model-development)
17+
- [DVC setup with AWS S3](#dvc-setup-with-aws-s3)
18+
- [Docker command for train and test](#docker-command-for-train-and-test)
19+
- [Deployment 01 - Kubernetes](#deployment-01---kubernetes)
20+
- [Architecture Diagram](#architecture-diagram)
21+
- [Manual setup](#manual-setup)
22+
- [D01 Installation](#d01-installation)
23+
- [Docker images to ECR](#docker-images-to-ecr)
24+
- [Cluster creation and configuration](#cluster-creation-and-configuration)
25+
- [ArgoCD Deployment](#argocd-deployment)
26+
- [Load testing with promethes and Grafana](#load-testing-with-promethes-and-grafana)
2627
- [Deletion Procedure](#deletion-procedure)
27-
- [Github actions - Automaing Manual setup End to End]
28-
- [Results Screenshots - D01](#results-screenshots)
29-
- [Video Link]
28+
- [Results Screenshots - D01 Manual](#results-screenshots---d01-manual)
29+
- [Github actions - Automaing Manual setup End to End - D01](#github-actions---automaing-manual-setup-end-to-end---d01)
30+
- [Results Screenshots - 01A and 01B Deployment Main](#results-screenshots---01a-and-01b-deployment-main)
31+
- [Github actions workflow Link - D01A-D01B](#github-actions-workflow-link---d01a-d01b)
32+
- [Results Screenshots - D01C On PR Request](#results-screenshots---d01c-on-pr-request)
33+
- [Github actions workflow Link - D01C](#github-actions-workflow-link---d01c)
34+
- [D01 - Deployment - Video Link](#d01---deployment---video-link)
3035
- [Deployment 02]
3136
- [Results Screenshots - D02](#results-screenshots)
32-
- [Deployment 03]
33-
- Installations
34-
- Usage
35-
- Docker image creation
36-
- [Results Screenshots - D03](#results-screenshots)
37+
- [Deployment 03 AWS Lambda](#deployment-03---aws-lambda)
38+
- [D03 - Installations](#d03---installations)
39+
- [D03 - Usage](#d03---installations)
40+
- [D03 - Docker Image Creation](#d03---docker-image-creation)
41+
- [Results Screenshots - D03](#results-screenshots---d03)
42+
- [Github actions workflow Link - D03](#github-actions-workflow-link---d03)
3743
- [Learnings](#learnings)
3844
- [Technologies Used]
3945

@@ -68,40 +74,6 @@ Kubernetes manifest files and Helm charts.
6874
- Screenshots and explanations of your pipeline in action.
6975
Performance metrics, including latency and stress test results.
7076

71-
### Pending for Deployment 1
72-
Code
73-
- Torchserve not giving proper predictions even for true class, check it - look later - check if its the preprocessing issue - done
74-
- Transfer_mar should transfer .pt file and accuracy text file to s3. it can be used for gradio, lambda and accuracy checking - done
75-
- Update Workflow to train on pull request and store to s3-dev and compare with prod model accuracy and comment in github actions - done
76-
- Update Workflow to train on push request and store to s3-stage for deployment
77-
- After stress test move from stage to prod
78-
- Comment on the commit with cml for stress test results
79-
- in actions change the workflow comment to PR
80-
- good one - ecr image May 17, 2025, 00:03:15 (UTC+05.5) - take this and move forward.
81-
82-
Note: If we build a gpu image from github runner then its throwing below error, so i have prebuilt image
83-
84-
```
85-
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
86-
warnings.warn("Can't initialize NVML")
87-
```
88-
if no thing works then use the model-onnx-server image
89-
### Pending for Deployment 3
90-
- lambda
91-
92-
todo: (17/05/2025)
93-
1. add manual screenshot - done
94-
2. check in github actions once with debug yaml and then add response time graph in load test in github actions
95-
3. Do model comparision for 01D
96-
3. do video for github actions 01A and 01B explanation and merger with manual video
97-
4. add github actions screenshot
98-
5. add architecture diagram
99-
100-
101-
Docs
102-
- Architecture diagram
103-
- Screenshots of deployment and video
104-
10577
Explanation: Deployment 01
10678
Architecture diagram
10779
dvc setup and pull - dataset
@@ -231,7 +203,7 @@ Make sure `data/processed/sports` and `data/processed/vegfruits` are only prese
231203
Set the S3 URL and push
232204
- `dvc push -r myremote`
233205

234-
### Docker command
206+
### Docker command for train and test
235207

236208
**Training with GPU**
237209

@@ -295,13 +267,16 @@ Note: This repo also has the procedure for ArgoCD deployment repo https://github
295267
296268
It took two months of effort in every weekend to complete this as a group and it may take you atleast 20$ of AWS cost if you are debugging and developing with AWS for this deployment alone. if you just going to test it , it may cost 0.05 to 1 dollar with github actions.
297269
270+
298271
### Architecture Diagram
299272
300-
- TODO
273+
![](./assets/deployment-01-kubernetes/snap_architecture_01.png)
274+
275+
![](./assets/deployment-01-kubernetes/snap_architecture_02.png)
301276
302277
### Manual Setup
303278
304-
### Installation
279+
### D01 Installation
305280
306281
**AWS install**
307282
@@ -713,7 +688,28 @@ Load testing Sports model
713688
714689
- `python3 utils/test_load_sports_2.py`
715690
716-
### D01 - Result Screenshots
691+
### Deletion Procedure
692+
693+
**Delete argocd deployments**
694+
695+
Verify app name
696+
- `kubectl get app -n argocd`
697+
698+
Delete cascade
699+
- `kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd`
700+
- `kubectl delete app model-deployments -n argocd`
701+
702+
**Deletion of cluster**
703+
704+
- `eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction`
705+
706+
**Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times
707+
the deletion gets failed so at backend something would be running and it may cost you high**
708+
709+
**If you are triggering a spot instance manually with `peresistent` type ensure that both the spot request is cancelled manually
710+
and the AWS instance is terminated finally**
711+
712+
### Results Screenshots - D01 Manual
717713
718714
**Ports establishment**
719715
@@ -804,29 +800,9 @@ Vegfruits Model
804800
805801
![](./assets/deployment-01-kubernetes/manual/response_times_vegfruits.png)
806802
807-
### Deletion Procedure
808803
809804
810-
**Delete argocd deployments**
811-
812-
Verify app name
813-
- `kubectl get app -n argocd`
814-
815-
Delete cascade
816-
- `kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd`
817-
- `kubectl delete app model-deployments -n argocd`
818-
819-
**Deletion of cluster**
820-
821-
- `eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction`
822-
823-
**Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times
824-
the deletion gets failed so at backend something would be running and it may cost you high**
825-
826-
**If you are triggering a spot instance manually with `peresistent` type ensure that both the spot request is cancelled manually
827-
and the AWS instance is terminated finally**
828-
829-
### D01 - Github Actions
805+
### Github actions - Automaing Manual setup End to End - D01
830806
831807
Two Workflows
832808
@@ -873,12 +849,13 @@ There are 3 repos
873849
874850
![](./assets/deployment-01-kubernetes/github-actions/github_secrets.png)
875851
852+
### Results Screenshots - 01A and 01B Deployment Main
876853
877854
**Main Deployment workflow**
878855
879856
1. On push to main branch triggered
880857
881-
2. Train and push to stage - (screenshot todo)
858+
2. Train and push to stage
882859
883860
3. Setups argo cd and deploy two models with HPA
884861
@@ -904,7 +881,13 @@ There are 3 repos
904881
905882
5. Comment after load test results in tested commit - (screenshot)
906883
907-
![](assets/deployment-01-kubernetes/github-actions/01A/sports_classifier_comments.png)
884+
![](assets/deployment-01-kubernetes/github-actions/01A/snap_graph_results_and_comment.png)
885+
886+
### Github actions workflow Link - D01A-D01B
887+
888+
- [D01 - Kubernetes - Main - Final workflow Result](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15092736500)
889+
890+
### Results Screenshots - D01C On PR Request
908891
909892
**PR Accuracy Test workflow**
910893
@@ -930,9 +913,19 @@ There are 3 repos
930913
931914
![](assets/deployment-01-kubernetes/github-actions/01B/01b_stops_ec2_runner.png)
932915
933-
6. On PR Request generates a comment (screenshot - todo - change)
916+
6. On PR Request generates a comment
917+
918+
![](assets/deployment-01-kubernetes/github-actions/01B/snap_dp01b_comment.png)
919+
920+
![](assets/deployment-01-kubernetes/github-actions/01B/snap_acc_comparision.png)
921+
922+
### Github actions workflow Link - D01C
923+
924+
- [D01C - Kubernetes - On PR - Final workflow Result](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15092490112)
925+
926+
### D01 - Deployment - Video Link
934927
935-
![](assets/deployment-01-kubernetes/github-actions/01B/pr_new.png)
928+
- todo
936929
937930
## Deployment 03 - AWS Lambda
938931
@@ -995,7 +988,7 @@ Kindly check "AWS Cloud formations" to verify that everything is deleted
995988
</debug>
996989
```
997990
998-
#### D03 - Results Screenshots
991+
### Results Screenshots - D03
999992
1000993
**Github actions deployment**
1001994
@@ -1012,12 +1005,23 @@ CDK package is used to push the image to ECR and lmabda service
10121005
10131006
![](./assets/deployment-03-lambda/snap_banana.png)
10141007
1008+
### Github actions workflow Link - D03
1009+
1010+
- [D03 - Lambda - Final run workflow link](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15085943040)
1011+
10151012
10161013
### Learnings
10171014
10181015
1. Learnt about making automated deployment with github actions
10191016
1020-
*Mostly other stuffs are already done in previous assignments and its combination of everything
1017+
2. If we build a gpu image from github runner then its throwing below error, so i have prebuilt image
1018+
1019+
```
1020+
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
1021+
warnings.warn("Can't initialize NVML")
1022+
```
1023+
1024+
*Mostly other stuffs are already done in previous assignments and this assignment is a combination of everything.
10211025
10221026
## Technologies Used
10231027

0 commit comments

Comments
 (0)