Skip to content

Commit c7f3b24

Browse files
sjpbbertiethorpesd109
authored
Update docs to include operations (#422)
* Update README.md * OSes supported as deploy hosts * undo readme OSes supported * add operations docs * simplify main README.md to only cover default configuration * move more-specific documentation into their own files * provide site docs directory * address docs review comments * Fix a / in docs Co-authored-by: Scott Davidson <49713135+sd109@users.noreply.github.com> * address PR comments on docs * address PR comments on docs --------- Co-authored-by: bertiethorpe <84867280+bertiethorpe@users.noreply.github.com> Co-authored-by: Scott Davidson <49713135+sd109@users.noreply.github.com>
1 parent 059d0f8 commit c7f3b24

File tree

8 files changed

+280
-109
lines changed

8 files changed

+280
-109
lines changed

README.md

Lines changed: 70 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -2,36 +2,47 @@
22

33
# StackHPC Slurm Appliance
44

5-
This repository contains playbooks and configuration to define a Slurm-based HPC environment including:
6-
- A Rocky Linux 9 and OpenHPC v3-based Slurm cluster.
7-
- Shared fileystem(s) using NFS (with servers within or external to the cluster).
8-
- Slurm accounting using a MySQL backend.
9-
- A monitoring backend using Prometheus and ElasticSearch.
10-
- Grafana with dashboards for both individual nodes and Slurm jobs.
11-
- Production-ready Slurm defaults for access and memory.
12-
- A Packer-based build pipeline for compute and login node images.
13-
14-
The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us!
15-
16-
While it is tested on OpenStack it should work on any cloud, except for node rebuild/reimaging features which are currently OpenStack-specific.
17-
18-
## Prerequisites
19-
It is recommended to check the following before starting:
20-
- You have root access on the "ansible deploy host" which will be used to deploy the appliance.
5+
This repository contains playbooks and configuration to define a Slurm-based HPC environment. This includes:
6+
- [Rocky Linux](https://rockylinux.org/)-based hosts.
7+
- [OpenTofu](https://opentofu.org/) configurations to define the cluster's infrastructure-as-code.
8+
- Packages for Slurm and MPI software stacks from [OpenHPC](https://openhpc.community/).
9+
- Shared fileystem(s) using NFS (with in-cluster or external servers) or [CephFS](https://docs.ceph.com/en/latest/cephfs/) via [Openstack Manila](https://wiki.openstack.org/wiki/Manila).
10+
- Slurm accounting using a MySQL database.
11+
- Monitoring integrated with Slurm jobs using Prometheus, ElasticSearch and Grafana.
12+
- A web-based portal from [OpenOndemand](https://openondemand.org/).
13+
- Production-ready default Slurm configurations for access and memory limits.
14+
- [Packer](https://developer.hashicorp.com/packer)-based image build configurations for node images.
15+
16+
The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters
17+
sharing a common configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs
18+
back upstream to us!
19+
20+
While it is tested on OpenStack it should work on any cloud with appropriate OpenTofu configuration files.
21+
22+
## Demonstration Deployment
23+
24+
The default configuration in this repository may be used to create a cluster to explore use of the appliance. It provides:
25+
- Persistent state backed by an OpenStack volume.
26+
- NFS-based shared file system backed by another OpenStack volume.
27+
28+
Note that the OpenOndemand portal and its remote apps are not usable with this default configuration.
29+
30+
It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud.
31+
32+
Before starting ensure that:
33+
- You have root access on the deploy host.
2134
- You can create instances using a Rocky 9 GenericCloud image (or an image based on that).
2235
- **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI. However the appliance will install the necessary packages if a GenericCloud image is used.
23-
- SSH keys get correctly injected into instances.
24-
- Instances have access to internet (note proxies can be setup through the appliance if necessary).
25-
- DNS works (if not this can be partially worked around but additional configuration will be required).
36+
- You have a SSH keypair defined in OpenStack, with the private key available on the deploy host.
37+
- Created instances have access to internet (note proxies can be setup through the appliance if necessary).
2638
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).
2739

28-
## Installation on deployment host
40+
### Setup deploy host
2941

30-
Current Operating Systems supported to be deploy hosts:
42+
The following operating systems are supported for the deploy host:
3143

3244
- Rocky Linux 9
3345
- Rocky Linux 8
34-
- Ubuntu 22.04
3546

3647
These instructions assume the deployment host is running Rocky Linux 8:
3748

@@ -40,115 +51,65 @@ These instructions assume the deployment host is running Rocky Linux 8:
4051
cd ansible-slurm-appliance
4152
./dev/setup-env.sh
4253

43-
## Overview of directory structure
44-
45-
- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates.
46-
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
47-
- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information.
48-
- `dev/`: Contains development tools.
49-
50-
## Environments
54+
You will also need to install [OpenTofu](https://opentofu.org/docs/intro/install/rpm/).
5155

52-
### Overview
56+
### Create a new environment
5357

54-
An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing:
55-
- Any deployment automation required - e.g. Terraform configuration or HEAT templates.
56-
- An ansible `inventory/` directory.
57-
- An `activate` script which sets environment variables to point to this configuration.
58-
- Optionally, additional playbooks in `/hooks` to run before or after the main tasks.
59-
60-
All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required.
61-
62-
### Creating a new environment
63-
64-
This repo contains a `cookiecutter` template which can be used to create a new environment from scratch. Run the [installation on deployment host](#Installation-on-deployment-host) instructions above, then in the repo root run:
58+
Use the `cookiecutter` template to create a new environment to hold your configuration. In the repository root run:
6559

6660
. venv/bin/activate
6761
cd environments
6862
cookiecutter skeleton
6963

7064
and follow the prompts to complete the environment name and description.
7165

72-
Alternatively, you could copy an existing environment directory.
73-
74-
Now add deployment automation if required, and then complete the environment-specific inventory as described below.
66+
**NB:** In subsequent sections this new environment is refered to as `$ENV`.
7567

76-
### Environment-specific inventory structure
68+
Now generate secrets for this environment:
7769

78-
The ansible inventory for the environment is in `environments/<environment>/inventory/`. It should generally contain:
79-
- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc.
80-
- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality.
81-
- Optionally, group variable files in `group_vars/<group_name>/overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/<group_name>.yml` (the use of `all` here is due to ansible's precedence rules).
70+
ansible-playbook ansible/adhoc/generate-passwords.yml
8271

83-
Although most of the inventory uses the group convention described above there are a few special cases:
84-
- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file.
85-
- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient.
86-
- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments/<environment>/inventory/group_vars/all/overrides.yml` file.
87-
- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments/<environent>/inventory/group_vars/all/secrets.yml`.
88-
- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path.
89-
- Each Slurm partition must have:
90-
- An inventory group `<cluster_name>_<partition_name>` defining the hosts it contains - these must be homogenous w.r.t CPU and memory.
91-
- An entry in the `openhpc_slurm_partitions` mapping in `environments/<environment>/inventory/group_vars/openhpc/overrides.yml`.
92-
See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options.
93-
- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file).
72+
### Define infrastructure configuration
9473

95-
## Creating a Slurm appliance
74+
Create an OpenTofu variables file to define the required infrastructure, e.g.:
9675

97-
NB: This section describes generic instructions - check for any environment-specific instructions in `environments/<environment>/README.md` before starting.
76+
# environments/$ENV/terraform/terraform.tfvars:
9877

99-
1. Activate the environment - this **must be done** before any other commands are run:
78+
cluster_name = "mycluster"
79+
cluster_net = "some_network" # *
80+
cluster_subnet = "some_subnet" # *
81+
key_pair = "my_key" # *
82+
control_node_flavor = "some_flavor_name"
83+
login_nodes = {
84+
login-0: "login_flavor_name"
85+
}
86+
cluster_image_id = "rocky_linux_9_image_uuid"
87+
compute = {
88+
general = {
89+
nodes: ["compute-0", "compute-1"]
90+
flavor: "compute_flavor_name"
91+
}
92+
}
10093

101-
source environments/<environment>/activate
94+
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables
95+
and descriptions see `environments/$ENV/terraform/terraform.tfvars`.
10296

103-
2. Deploy instances - see environment-specific instructions.
97+
### Deploy appliance
10498

105-
3. Generate passwords:
99+
ansible-playbook ansible/site.yml
106100

107-
ansible-playbook ansible/adhoc/generate-passwords.yml
101+
You can now log in to the cluster using:
108102

109-
This will output a set of passwords in `environments/<environment>/inventory/group_vars/all/secrets.yml`. It is recommended that these are encrpyted and then commited to git using:
103+
ssh rocky@$login_ip
110104

111-
ansible-vault encrypt inventory/group_vars/all/secrets.yml
105+
where the IP of the login node is given in `environments/$ENV/inventory/hosts.yml`
112106

113-
See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details.
114107

115-
4. Deploy the appliance:
116-
117-
ansible-playbook ansible/site.yml
118-
119-
or if you have encrypted secrets use:
120-
121-
ansible-playbook ansible/site.yml --ask-vault-password
122-
123-
Tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the `site` tasks.
124-
125-
5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using:
126-
127-
ansible-playbook ansible/adhoc/<playbook name>
128-
129-
Currently they include the following (see each playbook for links to documentation):
130-
- `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance.
131-
- `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster - see `ansible/roles/rebuild` for that).
132-
- `restart-slurm.yml`: Restart all Slurm daemons in the correct order.
133-
- `update-packages.yml`: Update specified packages on cluster nodes.
134-
135-
## Adding new functionality
136-
Please contact us for specific advice, but in outline this generally involves:
137-
- Adding a role.
138-
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
139-
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
140-
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
141-
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
142-
- Updating READMEs.
143-
144-
## Monitoring and logging
145-
146-
Please see the [monitoring-and-logging.README.md](docs/monitoring-and-logging.README.md) for details.
147-
148-
## CI/CD automation
149-
150-
The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include:
108+
## Overview of directory structure
151109

152-
- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published.
110+
- `environments/`: See [docs/environments.md](docs/environments.md).
111+
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
112+
- `packer/`: Contains automation to use Packer to build machine images for an enviromment - see the README in this directory for further information.
113+
- `dev/`: Contains development tools.
153114

154-
- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud.
115+
For further information see the [docs](docs/) directory.

docs/adding-functionality.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Adding new functionality
2+
3+
Please contact us for specific advice, but this generally involves:
4+
- Adding a role.
5+
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
6+
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
7+
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
8+
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
9+
- Updating READMEs.

docs/ci.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# CI/CD automation
2+
3+
The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include:
4+
5+
- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published.
6+
7+
- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud.
8+

0 commit comments

Comments
 (0)