Skip to content

Commit 999cfc8

Browse files
authored
Merge pull request #275 from stackhpc/feat/partition-ucx-dev
Allow defining UCX device per partition for hpctests
2 parents 6b702e1 + f26cfe2 commit 999cfc8

File tree

3 files changed

+8
-3
lines changed

3 files changed

+8
-3
lines changed

.github/workflows/stackhpc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,13 +66,13 @@ jobs:
6666
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
6767
terraform apply -auto-approve
6868
69-
- name: Delete infrastructure if failed due to lack of hosts
69+
- name: Delete infrastructure if provisioning failed
7070
run: |
7171
. venv/bin/activate
7272
. environments/.stackhpc/activate
7373
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
7474
terraform destroy -auto-approve
75-
if: ${{ steps.provision_servers.outcome == 'failure' }}
75+
if: failure() && steps.provision_servers.outcome == 'failure'
7676

7777
- name: Configure cluster
7878
run: |

ansible/roles/hpctests/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Role Variables
2626
- `hpctests_rootdir`: Required. Path to root of test directory tree, which must be on a r/w filesystem shared to all cluster nodes under test. The last directory component will be created.
2727
- `hpctests_partition`: Optional. Name of partition to use, otherwise default partition is used.
2828
- `hpctests_nodes`: Optional. A Slurm node expression, e.g. `'compute-[0-15,19]'` defining the nodes to use. If not set all nodes in the selected partition are used.
29-
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use).
29+
- `hpctests_ucx_net_devices`: Optional. Control which network device/interface to use, e.g. `mlx5_1:0`. The default of `all` (as per UCX) may not be appropriate for multi-rail nodes with different bandwidths on each device. See [here](https://openucx.readthedocs.io/en/master/faq.html#what-is-the-default-behavior-in-a-multi-rail-environment) and [here](https://github.com/openucx/ucx/wiki/UCX-environment-parameters#setting-the-devices-to-use). Alternatively a mapping of partition name (as `hpctests_partition`) to device/interface can be used. For partitions not defined in the mapping the default of `all` is used.
3030
- `hpctests_outdir`: Optional. Directory to use for test output on local host. Defaults to `$HOME/hpctests` (for local user).
3131
- `hpctests_hpl_NB`: Optional, default 192. The HPL block size "NB" - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/intel-oneapi-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html).
3232
- `hpctests_hpl_mem_frac`: Optional, default 0.8. The HPL problem size "N" will be selected to target using this fraction of each node's memory.

ansible/roles/hpctests/tasks/setup.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,8 @@
2828
owner: "{{ ansible_user }}"
2929
group: "{{ ansible_user }}"
3030
become: true
31+
32+
- name: Set fact for UCX_NET_DEVICES
33+
set_fact:
34+
hpctests_ucx_net_devices: "{{ hpctests_ucx_net_devices.get(hpctests_partition, 'all') }}"
35+
when: hpctests_ucx_net_devices is mapping

0 commit comments

Comments
 (0)