@@ -11,29 +11,35 @@ Following are the deployments made in this Capstone assignment
1111### Contents
1212
1313- [ Requirements] ( #requirements )
14- - [ Deployments] ( #Deployments )
15- - [ Dataset]
16- - [ Model Development]
17- - [ DVC setup with AWS S3]
18- - [ Docker command for train and test]
19- - [ Deployment 01 - Kubernetes]
20- - [ Manual setup]
21- - [ Architecture Diagram]
22- - [ Installation]
23- - [ Cluster creation and configuration]
24- - [ ArgoCD Deployment]
25- - [ Load testing with promethes and Grafana]
14+ - [ Deployments] ( #deployments )
15+ - [ Dataset] ( #dataset )
16+ - [ Model Development] ( #model-development )
17+ - [ DVC setup with AWS S3] ( #dvc-setup-with-aws-s3 )
18+ - [ Docker command for train and test] ( #docker-command-for-train-and-test )
19+ - [ Deployment 01 - Kubernetes] ( #deployment-01---kubernetes )
20+ - [ Architecture Diagram] ( #architecture-diagram )
21+ - [ Manual setup] ( #manual-setup )
22+ - [ D01 Installation] ( #d01-installation )
23+ - [ Docker images to ECR] ( #docker-images-to-ecr )
24+ - [ Cluster creation and configuration] ( #cluster-creation-and-configuration )
25+ - [ ArgoCD Deployment] ( #argocd-deployment )
26+ - [ Load testing with promethes and Grafana] ( #load-testing-with-promethes-and-grafana )
2627 - [ Deletion Procedure] ( #deletion-procedure )
27- - [ Github actions - Automaing Manual setup End to End]
28- - [ Results Screenshots - D01] ( #results-screenshots )
29- - [ Video Link]
28+ - [ Results Screenshots - D01 Manual] ( #results-screenshots---d01-manual )
29+ - [ Github actions - Automaing Manual setup End to End - D01] ( #github-actions---automaing-manual-setup-end-to-end---d01 )
30+ - [ Results Screenshots - 01A and 01B Deployment Main] ( #results-screenshots---01a-and-01b-deployment-main )
31+ - [ Github actions workflow Link - D01A-D01B] ( #github-actions-workflow-link---d01a-d01b )
32+ - [ Results Screenshots - D01C On PR Request] ( #results-screenshots---d01c-on-pr-request )
33+ - [ Github actions workflow Link - D01C] ( #github-actions-workflow-link---d01c )
34+ - [ D01 - Deployment - Video Link] ( #d01---deployment---video-link )
3035 - [ Deployment 02]
3136 - [ Results Screenshots - D02] ( #results-screenshots )
32- - [ Deployment 03]
33- - Installations
34- - Usage
35- - Docker image creation
36- - [ Results Screenshots - D03] ( #results-screenshots )
37+ - [ Deployment 03 AWS Lambda] ( #deployment-03---aws-lambda )
38+ - [ D03 - Installations] ( #d03---installations )
39+ - [ D03 - Usage] ( #d03---installations )
40+ - [ D03 - Docker Image Creation] ( #d03---docker-image-creation )
41+ - [ Results Screenshots - D03] ( #results-screenshots---d03 )
42+ - [ Github actions workflow Link - D03] ( #github-actions-workflow-link---d03 )
3743- [ Learnings] ( #learnings )
3844- [ Technologies Used]
3945
@@ -68,40 +74,6 @@ Kubernetes manifest files and Helm charts.
6874- Screenshots and explanations of your pipeline in action.
6975Performance metrics, including latency and stress test results.
7076
71- ### Pending for Deployment 1
72- Code
73- - Torchserve not giving proper predictions even for true class, check it - look later - check if its the preprocessing issue - done
74- - Transfer_mar should transfer .pt file and accuracy text file to s3. it can be used for gradio, lambda and accuracy checking - done
75- - Update Workflow to train on pull request and store to s3-dev and compare with prod model accuracy and comment in github actions - done
76- - Update Workflow to train on push request and store to s3-stage for deployment
77- - After stress test move from stage to prod
78- - Comment on the commit with cml for stress test results
79- - in actions change the workflow comment to PR
80- - good one - ecr image May 17, 2025, 00:03:15 (UTC+05.5) - take this and move forward.
81-
82- Note: If we build a gpu image from github runner then its throwing below error, so i have prebuilt image
83-
84- ```
85- /opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
86- warnings.warn("Can't initialize NVML")
87- ```
88- if no thing works then use the model-onnx-server image
89- ### Pending for Deployment 3
90- - lambda
91-
92- todo: (17/05/2025)
93- 1 . add manual screenshot - done
94- 2 . check in github actions once with debug yaml and then add response time graph in load test in github actions
95- 3 . Do model comparision for 01D
96- 3 . do video for github actions 01A and 01B explanation and merger with manual video
97- 4 . add github actions screenshot
98- 5 . add architecture diagram
99-
100-
101- Docs
102- - Architecture diagram
103- - Screenshots of deployment and video
104-
10577Explanation: Deployment 01
10678Architecture diagram
10779dvc setup and pull - dataset
@@ -231,7 +203,7 @@ Make sure `data/processed/sports` and `data/processed/vegfruits` are only prese
231203Set the S3 URL and push
232204- ` dvc push -r myremote `
233205
234- ### Docker command
206+ ### Docker command for train and test
235207
236208** Training with GPU**
237209
@@ -295,13 +267,16 @@ Note: This repo also has the procedure for ArgoCD deployment repo https://github
295267
296268It took two months of effort in every weekend to complete this as a group and it may take you atleast 20$ of AWS cost if you are debugging and developing with AWS for this deployment alone. if you just going to test it , it may cost 0.05 to 1 dollar with github actions.
297269
270+
298271### Architecture Diagram
299272
300- - TODO
273+ 
274+
275+ 
301276
302277### Manual Setup
303278
304- ### Installation
279+ ### D01 Installation
305280
306281**AWS install**
307282
@@ -713,7 +688,28 @@ Load testing Sports model
713688
714689- `python3 utils/test_load_sports_2.py`
715690
716- ### D01 - Result Screenshots
691+ ### Deletion Procedure
692+
693+ **Delete argocd deployments**
694+
695+ Verify app name
696+ - `kubectl get app -n argocd`
697+
698+ Delete cascade
699+ - `kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd`
700+ - `kubectl delete app model-deployments -n argocd`
701+
702+ **Deletion of cluster**
703+
704+ - `eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction`
705+
706+ **Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times
707+ the deletion gets failed so at backend something would be running and it may cost you high**
708+
709+ **If you are triggering a spot instance manually with `peresistent` type ensure that both the spot request is cancelled manually
710+ and the AWS instance is terminated finally**
711+
712+ ### Results Screenshots - D01 Manual
717713
718714**Ports establishment**
719715
@@ -804,29 +800,9 @@ Vegfruits Model
804800
805801
806802
807- ### Deletion Procedure
808803
809804
810- **Delete argocd deployments**
811-
812- Verify app name
813- - `kubectl get app -n argocd`
814-
815- Delete cascade
816- - `kubectl patch app model-deployments -p '{"metadata": {"finalizers": ["resources-finalizer.argocd.argoproj.io"]}}' --type merge -n argocd`
817- - `kubectl delete app model-deployments -n argocd`
818-
819- **Deletion of cluster**
820-
821- - `eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction`
822-
823- **Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times
824- the deletion gets failed so at backend something would be running and it may cost you high**
825-
826- **If you are triggering a spot instance manually with `peresistent` type ensure that both the spot request is cancelled manually
827- and the AWS instance is terminated finally**
828-
829- ### D01 - Github Actions
805+ ### Github actions - Automaing Manual setup End to End - D01
830806
831807Two Workflows
832808
@@ -873,12 +849,13 @@ There are 3 repos
873849
874850 
875851
852+ ### Results Screenshots - 01A and 01B Deployment Main
876853
877854**Main Deployment workflow**
878855
8798561. On push to main branch triggered
880857
881- 2. Train and push to stage - (screenshot todo)
858+ 2. Train and push to stage
882859
8838603. Setups argo cd and deploy two models with HPA
884861
@@ -904,7 +881,13 @@ There are 3 repos
904881
9058825. Comment after load test results in tested commit - (screenshot)
906883
907- 
884+ 
885+
886+ ### Github actions workflow Link - D01A-D01B
887+
888+ - [D01 - Kubernetes - Main - Final workflow Result](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15092736500)
889+
890+ ### Results Screenshots - D01C On PR Request
908891
909892**PR Accuracy Test workflow**
910893
@@ -930,9 +913,19 @@ There are 3 repos
930913
931914 
932915
933- 6. On PR Request generates a comment (screenshot - todo - change)
916+ 6. On PR Request generates a comment
917+
918+ 
919+
920+ 
921+
922+ ### Github actions workflow Link - D01C
923+
924+ - [D01C - Kubernetes - On PR - Final workflow Result](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15092490112)
925+
926+ ### D01 - Deployment - Video Link
934927
935- 
928+ - todo
936929
937930## Deployment 03 - AWS Lambda
938931
@@ -995,7 +988,7 @@ Kindly check "AWS Cloud formations" to verify that everything is deleted
995988 </debug>
996989 ```
997990
998- #### D03 - Results Screenshots
991+ ### Results Screenshots - D03
999992
1000993**Github actions deployment**
1001994
@@ -1012,12 +1005,23 @@ CDK package is used to push the image to ECR and lmabda service
10121005
10131006 
10141007
1008+ ### Github actions workflow Link - D03
1009+
1010+ - [D03 - Lambda - Final run workflow link](https://github.com/ajithvcoder/E2EMLOps/actions/runs/15085943040)
1011+
10151012
10161013### Learnings
10171014
101810151. Learnt about making automated deployment with github actions
10191016
1020- *Mostly other stuffs are already done in previous assignments and its combination of everything
1017+ 2. If we build a gpu image from github runner then its throwing below error, so i have prebuilt image
1018+
1019+ ```
1020+ /opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:734: UserWarning: Can't initialize NVML
1021+ warnings.warn("Can't initialize NVML")
1022+ ```
1023+
1024+ *Mostly other stuffs are already done in previous assignments and this assignment is a combination of everything.
10211025
10221026## Technologies Used
10231027
0 commit comments