Skip to content

Commit 080aff2

Browse files
verdimrcVerdi Marchenrico-usai
authored
Prepare DLAMI for ParallelCluster using pcluster build-image (#92)
* Prepare DLAMI using pcluster build-image * Update readme * Rename build spec *.{config-dlami => dlami}-*.yaml * Link to DLAMI release notes; remove bogus tag * Document compability between pcluster build-image and DLAMI Co-authored-by: enrico-usai <10634438+enrico-usai@users.noreply.github.com> * Simplify .yaml files --------- Co-authored-by: Verdi March <marcverd@amazon.com> Co-authored-by: enrico-usai <10634438+enrico-usai@users.noreply.github.com>
1 parent 469efb5 commit 080aff2

File tree

4 files changed

+226
-0
lines changed

4 files changed

+226
-0
lines changed

.markdownlint.jsonc

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"default": true,
3+
"MD041": false,
4+
"MD013": {
5+
"line_length": 100,
6+
"code_blocks": false
7+
},
8+
"MD033": false
9+
}
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Build:
2+
# Estimated build time: ~50min
3+
InstanceType: g4dn.4xlarge
4+
5+
# Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240101 / us-west-2
6+
#
7+
# See also: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
8+
ParentImage: ami-0a8dada81f29ad054
9+
10+
Image:
11+
RootVolume:
12+
Encrypted: false
13+
14+
# https://docs.aws.amazon.com/parallelcluster/latest/ug/build-Image-v3.html#yaml-build-image-Image-RootVolume-Size
15+
# Size (Optional, Integer)
16+
# Specifies the size of the root volume for the image, in GiB. The default size is the size of
17+
# the ParentImage plus 27 GiB.
18+
#
19+
# Below value matches the parent AMI's.
20+
Size: 75
21+
22+
# Optional tags for convenience when viewing on AWS console.
23+
Tags:
24+
#- Key: Name
25+
# Value: "Value to show in Name column, AMI list, EC2 console"
26+
- Key: base_ami_id
27+
Value: ami-0a8dada81f29ad054
28+
- Key: base_ami_name
29+
Value: "Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 20.04) 20240101"
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Build:
2+
# Estimated build time: ~50min
3+
InstanceType: g4dn.4xlarge
4+
5+
# Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20240102 / us-west-2
6+
#
7+
# See also: https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html
8+
ParentImage: ami-042f0ea5a85c8407a
9+
10+
Image:
11+
RootVolume:
12+
Encrypted: false
13+
14+
# https://docs.aws.amazon.com/parallelcluster/latest/ug/build-Image-v3.html#yaml-build-image-Image-RootVolume-Size
15+
# Size (Optional, Integer)
16+
# Specifies the size of the root volume for the image, in GiB. The default size is the size of
17+
# the ParentImage plus 27 GiB.
18+
#
19+
# Below value matches the parent AMI's.
20+
Size: 45
21+
22+
# Optional tags for convenience when viewing on AWS console.
23+
Tags:
24+
#- Key: Name
25+
# Value: "Value to show in Name column, AMI list, EC2 console"
26+
- Key: base_ami_id
27+
Value: ami-042f0ea5a85c8407a
28+
- Key: base_ami_name
29+
Value: "Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20240102"
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Prepare AWS DLAMI (Deep Learning AMI) for AWS ParallelCluster using `pcluster build-image`
2+
3+
This document shows how to use `pcluster build-image` to prepare
4+
[DLAMI](https://aws.amazon.com/machine-learning/amis/) for ParallelCluster.
5+
6+
Choose this approach when you have these requirements:
7+
8+
1. Build AMI using native AWS tools only. This approach uses the `pcluster` cli to launch [EC2 Image
9+
Builder](https://aws.amazon.com/image-builder/) jobs. No dependency to community toolkits.
10+
11+
2. To use DLAMI which already comes prebuilt with deep-learning stack optimized for AWS:
12+
[EFA](https://aws.amazon.com/hpc/efa/),
13+
[Docker](https://www.docker.com/products/container-runtime/), GPU stack
14+
([CUDA](https://developer.nvidia.com/cuda-toolkit), [cuDNN](https://developer.nvidia.com/cudnn),
15+
[nccl](https://github.com/NVIDIA/nccl), [aws-ofi-nccl](https://github.com/aws/aws-ofi-nccl),
16+
[gdrcopy](https://github.com/NVIDIA/gdrcopy),
17+
[nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit),
18+
[nccl-tests](https://github.com/NVIDIA/nccl-tests)) or [Neuron
19+
SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/), and frameworks (e.g.,
20+
[PyTorch](https://pytorch.org/)).
21+
22+
## 1. Install PCluster CLI
23+
24+
On your client machine (e.g., laptop) pre-configured with [AWS CLI](https://aws.amazon.com/cli/) and
25+
AWS credential, install PCluster CLI following this
26+
[documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html).
27+
28+
Below example uses a Python virtual environment:
29+
30+
```bash
31+
# Create and activate virtualenv for pcluster cli
32+
python3 /usr/bin/python3 -m venv .venv
33+
source .venv/bin/activate
34+
which pip
35+
# Eyeball we're using pip binary from the venv.
36+
37+
# Install pcluster to virtualenv
38+
pip install --upgrade pip setuptools
39+
pip install aws-parallelcluster
40+
pcluster version
41+
```
42+
43+
Here's another example to install `pcluster` to a `conda` environment:
44+
45+
```bash
46+
conda env create pc380-p312 python=3.12 # Replace pc and python versions as you like.
47+
conda activate pc380-p312
48+
conda install -c conda-forge aws-parallelcluster
49+
pcluster version
50+
```
51+
52+
## 2. Build AMI
53+
54+
We provide two build-specification examples: one for DLAMI Base, and another for DLAMI PyTorch; both
55+
with Ubuntu-20.04 flavor. You're strongly recommended to review the `.yaml` files, adjust as needed
56+
(e.g., use the latest DLAMI as parent), and only then proceed to build the custom AMI.
57+
58+
```bash
59+
export AWS_REGION=us-west-2
60+
61+
# Customize DLAMI Base
62+
pcluster build-image -r $AWS_REGION -c 01.dlami-ub2004-base-gpu.yaml -i pc-dlami-base
63+
64+
# Customize DLAMI PyTorch
65+
pcluster build-image -r $AWS_REGION -c 02.dlami-ub2004-pytorch-gpu.yaml -i pc-dlami-pytorch
66+
```
67+
68+
Each `pcluster build-image` command displays the results in JSON, e.g.,
69+
70+
```json
71+
{
72+
"image": {
73+
"imageId": "pc-dlami-base",
74+
"imageBuildStatus": "BUILD_IN_PROGRESS",
75+
"cloudformationStackStatus": "CREATE_IN_PROGRESS",
76+
"cloudformationStackArn": "arn:aws:cloudformation:us-west-2:111122223333:stack/pc-dlami-base/097178b0-3037-11ee-97c3-0672f191cc71",
77+
"region": "us-west-2",
78+
"version": "3.8.0"
79+
}
80+
}
81+
```
82+
83+
<details>
84+
<summary><b>Pro-tips</b>: syntax-colored <code>pcluster</code> output</summary>
85+
86+
To syntax-color the `pcluster`'s JSON output, pipe the command to `jq`. Examples below.
87+
88+
```bash
89+
export AWS_REGION=us-west-2
90+
91+
# Customize DLAMI Base
92+
pcluster build-image -r $AWS_REGION -c 01.dlami-ub2004-base-gpu.yaml -i pc-dlami-ubuntu-base-gpu | jq .
93+
94+
# Customize DLAMI PyTorch
95+
pcluster build-image -r $AWS_REGION -c 02.dlami-ub2004-pytorch-gpu.yaml -i pc-dlami-ubuntu-base-pytorch | jq .
96+
```
97+
98+
</details>
99+
100+
While the image is building, you should see a CloudFormation stack with the same name as the AMI
101+
name (e.g., `pc-dlami-ubuntu-base-gpu` for the first example above). From this CloudFormation stack,
102+
you can trace the AWS resources involved in the build process, such as the EC2 instance, the Image
103+
Builder pipeline, etc.
104+
105+
You can also check the build log in CloudWatch. Look for log group
106+
`/aws/imagebuilder/ParallelClusterImage-<AMI_NAME>` and log stream `<PCLUSTER_VERSION>/1`, e.g., for
107+
the first example above are `/aws/imagebuilder/ParallelClusterImage-pc-dlami-ubuntu-base-pytorch`
108+
and `3.8.0/1`.
109+
110+
<details>
111+
<summary><b>Pro-tips</b>: fetch build logs using community cli <code>awslogs</code></summary>
112+
113+
Below are examples to use the community cli `awslogs` to fetch from CloudWatch the build log. To
114+
install `awslogs`, please follow its [installation
115+
instructions](https://github.com/jorgebastida/awslogs#installation).
116+
117+
Below example assumes ami named `pc-dlami-base` and `pcluster` version 3.8.0. Please update the
118+
log group and stream names accordingly. When in doubt, check the log group and stream names from
119+
the CloudWatch console.
120+
121+
```bash
122+
# Watch the build-image process of ami name `pc-dlami-base`.
123+
awslogs get -GS --aws-region=us-west-2 \
124+
/aws/imagebuilder/ParallelClusterImage-pc-dlami-base 3.8.0/1 --watch -i 30 -s10min
125+
126+
# Save all logs to a local file. Will also pull the failed logs from the earlier attempt.
127+
#
128+
# -s4d instructs the cli tool to fetch logs from the last 4d. Without this flags, it fecthes only
129+
# a few entries, or even none at all.
130+
awslogs get -GS --aws-region=us-west-2 \
131+
/aws/imagebuilder/ParallelClusterImage-pc-dlami-base 3.8.0/1 -s4d &> build-image-01-success.log
132+
```
133+
134+
</details>
135+
136+
## 3. Appendix: advance topics
137+
138+
### 3.1. Update OS packages
139+
140+
Not recommended for DLAMI. Occasionally the build may fail. This happens when the Lustre client for
141+
the new kernel is not yet released *at AMI build time*.
142+
143+
### 3.2. Build-image cookbook
144+
145+
`pcluster build-image` preserves the pre-built packages in the parent DLAMI.
146+
147+
1. NVIDIA driver and CUDA are [never re-installed][pcbi-skip-nvidia]. You have to enable them with a
148+
[dev setting][pcbi-dev-setting].
149+
150+
2. EFA stack [won't be re-installed][pcbi-skip-efa] if already installed.
151+
152+
<!-- Below are permalinks to `develop` branch -->
153+
[pcbi-skip-nvidia]:
154+
<https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-platform/resources/nvidia_driver/partial/_nvidia_driver_common.rb#L23>
155+
[pcbi-dev-setting]:
156+
<https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb#L2>
157+
[pcbi-skip-efa]:
158+
<https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-environment/resources/efa/partial/_common.rb#L24>
159+

0 commit comments

Comments
 (0)