Skip to content

Commit 2abc6bd

Browse files
authored
Improvements/#371 change nccl version (#372)
* Change nccl and efa installer in micro bench. Close #371 * Fix mwemory size for nccl test on k8s. Close #312
1 parent 33b6335 commit 2abc6bd

File tree

4 files changed

+16
-16
lines changed

4 files changed

+16
-16
lines changed

micro-benchmarks/nccl-tests/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -36,17 +36,17 @@ The NCCL tests are packaged in a container.
3636
> | Variable | Default | Repository |
3737
> |-----------------------|-------------|---------------------------------------------------------------------------------------------|
3838
> |`GDRCOPY_VERSION` | `v2.4.1` | [link](https://github.com/NVIDIA/gdrcopy) |
39-
> |`EFA_INSTALLER_VERSION`| `1.31.0` | [link](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable) |
40-
> |`AWS_OFI_NCCL_VERSION` | `v1.8.1-aws`| [link](https://github.com/aws/aws-ofi-nccl) |
41-
> |`NCCL_VERSION` | `v2.20.3-1` | [link](https://github.com/NVIDIA/nccl) |
39+
> |`EFA_INSTALLER_VERSION`| `1.33.0` | [link](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable) |
40+
> |`AWS_OFI_NCCL_VERSION` | `v1.9.2-aws`| [link](https://github.com/aws/aws-ofi-nccl) |
41+
> |`NCCL_VERSION` | `v2.21.5-1` | [link](https://github.com/NVIDIA/nccl) |
4242
> |`NCCL_TESTS_VERSION` | `v2.13.9` | [link](https://github.com/NVIDIA/nccl-tests) |
4343
4444
### Build the container
4545
1. Build the container image with the command below:
4646
```bash
47-
EFA_INSTALLER_VERSION=1.31.0
48-
AWS_OFI_NCCL_VERSION=v1.8.1-aws
49-
NCCL_VERSION=v2.20.3-1
47+
EFA_INSTALLER_VERSION=1.33.0
48+
AWS_OFI_NCCL_VERSION=v1.9.2-aws
49+
NCCL_VERSION=v2.21.5-1
5050
NCCL_TESTS_VERSION=v2.13.9
5151
docker build -f nccl-tests.Dockerfile \
5252
--build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
@@ -81,9 +81,9 @@ To run the NCCL tests on EKS, you will need to build the container image, then p
8181

8282
1. Create the ECR repository if it does not exist
8383
```bash
84-
EFA_INSTALLER_VERSION=1.31.0
85-
AWS_OFI_NCCL_VERSION=v1.8.1-aws
86-
NCCL_VERSION=v2.20.3-1
84+
EFA_INSTALLER_VERSION=1.33.0
85+
AWS_OFI_NCCL_VERSION=v1.9.2-aws
86+
NCCL_VERSION=v2.21.5-1
8787
NCCL_TESTS_VERSION=v2.13.9
8888
ECR_REPOSITORY_NAME="nccl-tests"
8989
TAG="${EFA_INSTALLER_VERSION}-${AWS_OFI_NCCL_VERSION}-${NCCL_VERSION}-${NCCL_TESTS_VERSION}"

micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,12 +88,12 @@ spec:
8888
nvidia.com/gpu: 8
8989
hugepages-2Mi: 5120Mi
9090
vpc.amazonaws.com/efa: 32
91-
memory: 8000Mi
91+
memory: 32000Mi
9292
requests:
9393
nvidia.com/gpu: 8
9494
hugepages-2Mi: 5120Mi
9595
vpc.amazonaws.com/efa: 32
96-
memory: 8000Mi
96+
memory: 32000Mi
9797
volumes:
9898
- name: shmem
9999
hostPath:

micro-benchmarks/nccl-tests/nccl-tests.Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
FROM nvidia/cuda:12.2.2-devel-ubuntu22.04
44

55
ARG GDRCOPY_VERSION=v2.4.1
6-
ARG EFA_INSTALLER_VERSION=1.31.0
7-
ARG AWS_OFI_NCCL_VERSION=v1.8.1-aws
8-
ARG NCCL_VERSION=v2.20.3-1
6+
ARG EFA_INSTALLER_VERSION=1.33.0
7+
ARG AWS_OFI_NCCL_VERSION=v1.9.2-aws
8+
ARG NCCL_VERSION=v2.21.5-1
99
ARG NCCL_TESTS_VERSION=v2.13.9
1010

1111
RUN apt-get update -y && apt-get upgrade -y

micro-benchmarks/nccl-tests/slurm/nccl-tests-ami.sbatch

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ set -ex
1313

1414
# This script is designed to run by default on the Deep Learning AMI, Ubuntu 20.04
1515
# See https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/
16-
ALL_REDUCE_BINARY=${1:-/usr/local/cuda-12.2/efa/test-cuda-12.2/all_reduce_perf}
17-
ADDITIONAL_LD_LIBRARY_PATH=${2:-}
16+
ALL_REDUCE_BINARY=${1:-/usr/local/cuda-12.3/efa/test-cuda-12.3/all_reduce_perf}
17+
ADDITIONAL_LD_LIBRARY_PATH=${2:-/usr/local/cuda-12.3/lib}
1818

1919
# Get Hostname to Instance ID mapping
2020
mpirun -N 1 bash -c 'echo $(hostname) ➡️ $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'

0 commit comments

Comments
 (0)