|
| 1 | +# Prepare AWS DLAMI (Deep Learning AMI) for AWS ParallelCluster using `pcluster build-image` |
| 2 | + |
| 3 | +This document shows how to use `pcluster build-image` to prepare |
| 4 | +[DLAMI](https://aws.amazon.com/machine-learning/amis/) for ParallelCluster. |
| 5 | + |
| 6 | +Choose this approach when you have these requirements: |
| 7 | + |
| 8 | +1. Build AMI using native AWS tools only. This approach uses the `pcluster` cli to launch [EC2 Image |
| 9 | + Builder](https://aws.amazon.com/image-builder/) jobs. No dependency to community toolkits. |
| 10 | + |
| 11 | +2. To use DLAMI which already comes prebuilt with deep-learning stack optimized for AWS: |
| 12 | + [EFA](https://aws.amazon.com/hpc/efa/), |
| 13 | + [Docker](https://www.docker.com/products/container-runtime/), GPU stack |
| 14 | + ([CUDA](https://developer.nvidia.com/cuda-toolkit), [cuDNN](https://developer.nvidia.com/cudnn), |
| 15 | + [nccl](https://github.com/NVIDIA/nccl), [aws-ofi-nccl](https://github.com/aws/aws-ofi-nccl), |
| 16 | + [gdrcopy](https://github.com/NVIDIA/gdrcopy), |
| 17 | + [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit), |
| 18 | + [nccl-tests](https://github.com/NVIDIA/nccl-tests)) or [Neuron |
| 19 | + SDK](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/), and frameworks (e.g., |
| 20 | + [PyTorch](https://pytorch.org/)). |
| 21 | + |
| 22 | +## 1. Install PCluster CLI |
| 23 | + |
| 24 | +On your client machine (e.g., laptop) pre-configured with [AWS CLI](https://aws.amazon.com/cli/) and |
| 25 | +AWS credential, install PCluster CLI following this |
| 26 | +[documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/install-v3-virtual-environment.html). |
| 27 | + |
| 28 | +Below example uses a Python virtual environment: |
| 29 | + |
| 30 | +```bash |
| 31 | +# Create and activate virtualenv for pcluster cli |
| 32 | +python3 /usr/bin/python3 -m venv .venv |
| 33 | +source .venv/bin/activate |
| 34 | +which pip |
| 35 | +# Eyeball we're using pip binary from the venv. |
| 36 | + |
| 37 | +# Install pcluster to virtualenv |
| 38 | +pip install --upgrade pip setuptools |
| 39 | +pip install aws-parallelcluster |
| 40 | +pcluster version |
| 41 | +``` |
| 42 | + |
| 43 | +Here's another example to install `pcluster` to a `conda` environment: |
| 44 | + |
| 45 | +```bash |
| 46 | +conda env create pc380-p312 python=3.12 # Replace pc and python versions as you like. |
| 47 | +conda activate pc380-p312 |
| 48 | +conda install -c conda-forge aws-parallelcluster |
| 49 | +pcluster version |
| 50 | +``` |
| 51 | + |
| 52 | +## 2. Build AMI |
| 53 | + |
| 54 | +We provide two build-specification examples: one for DLAMI Base, and another for DLAMI PyTorch; both |
| 55 | +with Ubuntu-20.04 flavor. You're strongly recommended to review the `.yaml` files, adjust as needed |
| 56 | +(e.g., use the latest DLAMI as parent), and only then proceed to build the custom AMI. |
| 57 | + |
| 58 | +```bash |
| 59 | +export AWS_REGION=us-west-2 |
| 60 | + |
| 61 | +# Customize DLAMI Base |
| 62 | +pcluster build-image -r $AWS_REGION -c 01.dlami-ub2004-base-gpu.yaml -i pc-dlami-base |
| 63 | + |
| 64 | +# Customize DLAMI PyTorch |
| 65 | +pcluster build-image -r $AWS_REGION -c 02.dlami-ub2004-pytorch-gpu.yaml -i pc-dlami-pytorch |
| 66 | +``` |
| 67 | + |
| 68 | +Each `pcluster build-image` command displays the results in JSON, e.g., |
| 69 | + |
| 70 | +```json |
| 71 | +{ |
| 72 | + "image": { |
| 73 | + "imageId": "pc-dlami-base", |
| 74 | + "imageBuildStatus": "BUILD_IN_PROGRESS", |
| 75 | + "cloudformationStackStatus": "CREATE_IN_PROGRESS", |
| 76 | + "cloudformationStackArn": "arn:aws:cloudformation:us-west-2:111122223333:stack/pc-dlami-base/097178b0-3037-11ee-97c3-0672f191cc71", |
| 77 | + "region": "us-west-2", |
| 78 | + "version": "3.8.0" |
| 79 | + } |
| 80 | +} |
| 81 | +``` |
| 82 | + |
| 83 | +<details> |
| 84 | + <summary><b>Pro-tips</b>: syntax-colored <code>pcluster</code> output</summary> |
| 85 | + |
| 86 | + To syntax-color the `pcluster`'s JSON output, pipe the command to `jq`. Examples below. |
| 87 | + |
| 88 | + ```bash |
| 89 | + export AWS_REGION=us-west-2 |
| 90 | + |
| 91 | + # Customize DLAMI Base |
| 92 | + pcluster build-image -r $AWS_REGION -c 01.dlami-ub2004-base-gpu.yaml -i pc-dlami-ubuntu-base-gpu | jq . |
| 93 | + |
| 94 | + # Customize DLAMI PyTorch |
| 95 | + pcluster build-image -r $AWS_REGION -c 02.dlami-ub2004-pytorch-gpu.yaml -i pc-dlami-ubuntu-base-pytorch | jq . |
| 96 | + ``` |
| 97 | + |
| 98 | +</details> |
| 99 | + |
| 100 | +While the image is building, you should see a CloudFormation stack with the same name as the AMI |
| 101 | +name (e.g., `pc-dlami-ubuntu-base-gpu` for the first example above). From this CloudFormation stack, |
| 102 | +you can trace the AWS resources involved in the build process, such as the EC2 instance, the Image |
| 103 | +Builder pipeline, etc. |
| 104 | + |
| 105 | +You can also check the build log in CloudWatch. Look for log group |
| 106 | +`/aws/imagebuilder/ParallelClusterImage-<AMI_NAME>` and log stream `<PCLUSTER_VERSION>/1`, e.g., for |
| 107 | +the first example above are `/aws/imagebuilder/ParallelClusterImage-pc-dlami-ubuntu-base-pytorch` |
| 108 | +and `3.8.0/1`. |
| 109 | + |
| 110 | +<details> |
| 111 | + <summary><b>Pro-tips</b>: fetch build logs using community cli <code>awslogs</code></summary> |
| 112 | + |
| 113 | + Below are examples to use the community cli `awslogs` to fetch from CloudWatch the build log. To |
| 114 | + install `awslogs`, please follow its [installation |
| 115 | + instructions](https://github.com/jorgebastida/awslogs#installation). |
| 116 | + |
| 117 | + Below example assumes ami named `pc-dlami-base` and `pcluster` version 3.8.0. Please update the |
| 118 | + log group and stream names accordingly. When in doubt, check the log group and stream names from |
| 119 | + the CloudWatch console. |
| 120 | + |
| 121 | + ```bash |
| 122 | + # Watch the build-image process of ami name `pc-dlami-base`. |
| 123 | + awslogs get -GS --aws-region=us-west-2 \ |
| 124 | + /aws/imagebuilder/ParallelClusterImage-pc-dlami-base 3.8.0/1 --watch -i 30 -s10min |
| 125 | + |
| 126 | + # Save all logs to a local file. Will also pull the failed logs from the earlier attempt. |
| 127 | + # |
| 128 | + # -s4d instructs the cli tool to fetch logs from the last 4d. Without this flags, it fecthes only |
| 129 | + # a few entries, or even none at all. |
| 130 | + awslogs get -GS --aws-region=us-west-2 \ |
| 131 | + /aws/imagebuilder/ParallelClusterImage-pc-dlami-base 3.8.0/1 -s4d &> build-image-01-success.log |
| 132 | + ``` |
| 133 | + |
| 134 | +</details> |
| 135 | + |
| 136 | +## 3. Appendix: advance topics |
| 137 | + |
| 138 | +### 3.1. Update OS packages |
| 139 | + |
| 140 | +Not recommended for DLAMI. Occasionally the build may fail. This happens when the Lustre client for |
| 141 | +the new kernel is not yet released *at AMI build time*. |
| 142 | + |
| 143 | +### 3.2. Build-image cookbook |
| 144 | + |
| 145 | +`pcluster build-image` preserves the pre-built packages in the parent DLAMI. |
| 146 | + |
| 147 | +1. NVIDIA driver and CUDA are [never re-installed][pcbi-skip-nvidia]. You have to enable them with a |
| 148 | + [dev setting][pcbi-dev-setting]. |
| 149 | + |
| 150 | +2. EFA stack [won't be re-installed][pcbi-skip-efa] if already installed. |
| 151 | + |
| 152 | +<!-- Below are permalinks to `develop` branch --> |
| 153 | +[pcbi-skip-nvidia]: |
| 154 | + <https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-platform/resources/nvidia_driver/partial/_nvidia_driver_common.rb#L23> |
| 155 | +[pcbi-dev-setting]: |
| 156 | + <https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb#L2> |
| 157 | +[pcbi-skip-efa]: |
| 158 | + <https://github.com/aws/aws-parallelcluster-cookbook/blob/79458c1926ab71bb54d676d93fe975041cf46f75/cookbooks/aws-parallelcluster-environment/resources/efa/partial/_common.rb#L24> |
| 159 | + |
0 commit comments