Skip to content

Commit 53eb9d5

Browse files
committed
ON-15464: Troubleshooting guide
* sfnettest example now uses latency profile by default rather than recommended * Removed trailing slash on `$REGISTRY_BASE` causing invalid duplicate slashes * Added built-in syntax validation to user-modifiable commands * Documented containers of Onload Device Plugin pod * Updated README to reflect recent `setPreload` changes * Symlinks to existing docs in `docs/`
1 parent a6d199b commit 53eb9d5

File tree

10 files changed

+447
-29
lines changed

10 files changed

+447
-29
lines changed

DEVELOPING.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,38 +13,39 @@ and [Kubebuilder](https://kubebuilder.io/).
1313

1414
The Onload Operator and Onload Device Plugin consume Onload container images (`onload-user` and either `onload-source` or `onload-module`). You may wish to pre-populate your cluster's container image registry, either with the [official images provided](README.md#provided-images) or [your own builds](README.md#build).
1515

16-
## Build and deploy Onload Operator from source
16+
## Build and deploy from source
1717

1818
Configure a development registry and configure cluster for [insecure registries](README.md#insecure-registries)
19-
if required. Specify the base of the following images:
19+
if required. Specify the following image locations:
2020

2121
```sh
22-
export REGISTRY_BASE=image-registry.openshift-image-registry.svc:5000/onload-clusterlocal/
22+
export REGISTRY_BASE=image-registry.openshift-image-registry.svc:5000/onload-clusterlocal
23+
export IMG=$REGISTRY_BASE/onload-operator:latest
24+
export DEVICE_IMG=$REGISTRY_BASE/onload-device-plugin:latest
2325
```
2426

2527
Create and push the Onload Operator controller image:
2628

2729
```sh
28-
make docker-build docker-push IMG=$REGISTRY_BASE/onload-operator:latest
30+
make docker-build docker-push
2931
```
3032

3133
Create and push the Onload Device Plugin image:
3234

3335
```sh
34-
export DEVICE_IMG=$REGISTRY_BASE/onload-device-plugin:latest
3536
make device-plugin-docker-build device-plugin-docker-push
3637
```
3738

3839
Deploy the Onload Operator:
3940

4041
```sh
41-
make deploy IMG=$REGISTRY_BASE/operator:latest
42+
make deploy
4243
```
44+
4345
Ensure that `$DEVICE_IMG` is exported when deploying the operator, or append `DEVICE_IMG=...` to the make invocation.
4446

4547
Continue with [deploying the Onload CR](README.md#onload-custom-resource-cr).
4648

47-
4849
## Footnotes
4950

5051
Copyright (c) 2023 Advanced Micro Devices, Inc.

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ sfc-mc-build: ## Build sfc mc yaml
165165

166166
.PHONY: sfc-mc-deploy
167167
sfc-mc-deploy: ## Deploy sfc mc yaml
168-
oc apply -f ./scripts/machineconfig/output/99-sfc-machineconfig.yaml
168+
oc apply --validate=true -f ./scripts/machineconfig/output/99-sfc-machineconfig.yaml
169169

170170
.PHONY: sfc-mc-undeploy
171171
sfc-mc-undeploy:

README.md

Lines changed: 41 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ When using [in-cluster builds](#onload-module-in-cluster-builds), other dependen
6363
the method selected. These may include `ubi-minimal` container image and
6464
[UBI RPM repositories](https://access.redhat.com/articles/4238681).
6565

66+
Nodes require 60MB of root-writable local storage, by default in `/opt`.
67+
6668
### Provided Images
6769

6870
This repository's YAML configuration uses the following images by default:
@@ -130,19 +132,33 @@ git clone -b v3.0 https://github.com/Xilinx-CNS/kubernetes-onload && cd kubernet
130132

131133
cp -r config/samples/default-clusterlocal config/samples/my-operator
132134
$EDITOR config/samples/my-operator/kustomization.yaml
133-
kubectl apply -k config/samples/my-operator
135+
kubectl apply --validate=true -k config/samples/my-operator
134136
```
135137

138+
> [!TIP]
139+
> Replacing `kubectl apply` with `kubectl kustomize` will output a complete YAML manifest file which can be copied to a
140+
> network that does not have access to this repository.
141+
136142
### Onload Device Plugin
137143

138144
The Onload Device Plugin implements the [Kubernetes Device Plugin API](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)
139145
to expose a [Kubernetes Resource](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)
140146
named `amd.com/onload`.
141147

142-
It is distributed as the container image `onload-device-plugin` and is deployed and configured entirely by
143-
the Onload Operator. Its image location is configured as an environment variable within the Onload Operator deployment
144-
([see above](#local-onload-operator-images-in-restricted-networks)) and its ImagePullPolicy as part of
145-
[Onload Custom Resource (CR)](#onload-custom-resource-cr) along with its other customisation properties.
148+
It is distributed as the container image `onload-device-plugin`. The image location is configured as an environment
149+
variable within the Onload Operator deployment ([see above](#local-onload-operator-images-in-restricted-networks)) and
150+
its ImagePullPolicy as part of [Onload Custom Resource (CR)](#onload-custom-resource-cr), along with its other
151+
customisation properties.
152+
153+
The Onload Operator manages an Onload Device Plugin DaemonSet which deploys, to each node selected for acceleration,
154+
a pod consisting of 3 containers:
155+
156+
* Init (`init` container, `onload-user` image)
157+
-- for copying Onload files to host filesystem and Onload Worker volume.
158+
* Onload Worker (`onload-worker` container, `onload-device-plugin` image)
159+
-- provides Onload Control Plane environment; privileged access to network namespaces.
160+
* Onload Device Plugin (`device-plugin` container, `onload-device-plugin` image)
161+
-- for Kubernetes Device Plugin API; privileged access to Kubernetes API.
146162

147163
### Onload Custom Resource (CR)
148164

@@ -270,7 +286,8 @@ spec:
270286
amd.com/onload: 1
271287
```
272288
273-
All applications started within the pod environment will be accelerated due to the `LD_PRELOAD` environment variable.
289+
All applications started within the pod environment will be accelerated due to the `LD_PRELOAD` environment variable
290+
unless `setPreload: false` is configured in Onload CR.
274291

275292
### Resource `amd.com/onload`
276293

@@ -298,6 +315,17 @@ Binary mounts (if `mountOnload` is true, by default in `/opt/onload/usr/bin/`)
298315
If you wish to customise where files are mounted in the container's filesystem this can be configured with the fields
299316
of `spec.devicePlugin` in an Onload CR.
300317

318+
> [!IMPORTANT]
319+
> Kubernetes Device Plugin only affects initial pod scheduling
320+
>
321+
> Kubernetes Device Plugin is designed to configure pods once only, at creation time. If the Onload CR is re-applied to
322+
> the cluster with settings that would change pod environment -- for example, changing the value of `setPreload` --
323+
> then running pods must be recreated before using these changes.
324+
>
325+
> Additionally, Kubernetes does not evict pods when node resources are removed; pods do not automatically have a formal
326+
> dependency on Onload Device Plugin or Onload Module. This has the advantage that minor Onload Operator behaviour
327+
> does not affect the workloads its components pre-configured.
328+
301329
### Example client-server with sfnettest
302330

303331
Please see [config/samples/sfnettest](config/samples/sfnettest).
@@ -348,6 +376,10 @@ Currently the script produces ConfigMaps with a fixed naming structure,
348376
for example if you want to create a ConfigMap from a profile called
349377
`name.opf` the generated name will be `onload-name-profile`.
350378

379+
## Troubleshooting
380+
381+
Please see dedicated [troubleshooting guide](docs/troubleshooting.md).
382+
351383
## Build
352384

353385
### Onload Module pre-built images
@@ -373,6 +405,9 @@ Please see [DEVELOPING](DEVELOPING.md) documentation.
373405
Developing Onload Operator does not require building these images as official images are available.
374406

375407
If you wish to build these images, please follow ['Distributing as container image' in Onload repository's DEVELOPING](https://github.com/Xilinx-CNS/onload/blob/master/DEVELOPING.md#distributing-as-container-image).
408+
This includes building debug versions. All Onload images in use must be consistent, in exact commit and build
409+
parameters. For example, a debug build of `onload-user` must be used with a debug build of `onload-module`. Build
410+
parameter specification is provided in the sample Onload CRs for the in-cluster build method.
376411

377412
### Insecure registries
378413

config/samples/sfnettest/README.md

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,55 @@
11
# Onload application example: sfnettest
22

3-
Here, we run a small utility, [`sfnt-pingpong` from sfnettest](https://github.com/Xilinx-CNS/cns-sfnettest), in a client and server pair to demonstrate Onload acceleration.
3+
Here, we run a small utility, [`sfnt-pingpong` from sfnettest](https://github.com/Xilinx-CNS/cns-sfnettest), in a
4+
client and server pair to demonstrate Onload acceleration.
5+
6+
The sfnettest image is solely focused on its performance utils and thus has a micro shell environment; in-depth network
7+
inspection should be performed using dedicated software or Onload tools available on the node's host filesystem.
48

59
## Deploy
610

7-
The example will require customisation to your environment. By default, this will deploy two pods running on nodes named `compute-0` and `compute-1`:
11+
The following example [client-server.yaml](client-server.yaml) will require customisation to your environment. The
12+
manifest utilises two separately deployed resources which are recommended as part of a full Onload Operator deployment:
13+
14+
* A [Multus network](../../../docs/nad.md)
15+
-- connects the pods to hardware that supports acceleration.
16+
* An [Onload profile](../../../README.md#using-onload-profiles)
17+
-- sets environment variables for the pod which are then consumed by userland Onload running in the container(s).
18+
19+
Review the defaults and apply:
20+
21+
* Node names: `compute-0` and `compute-1`
22+
* Network: `ipvlan-bond0` (Multus)
23+
* Namespace: `default`
24+
* Onload accelerated: `amd.com/onload` resource
25+
* Onload profile: `onload-latency-profile`
826

927
```sh
10-
kubectl apply -f client-server.yaml
28+
kubectl apply --validate=true -f client-server.yaml
29+
kubectl describe pods -l app.kubernetes.io/part-of=sfnettest
1130
```
1231

1332
## Interactive test
1433

15-
Obtain the SFC interface's IP address of the `onload-sfnettest-server` pod:
34+
Obtain the SFC interface's IP address of the `onload-sfnettest-server` pod, here `198.19.0.1`:
1635

1736
```sh
18-
$ kubectl describe pod onload-sfnettest-server | grep AddedInterface
19-
Normal AddedInterface 24s multus Add eth0 [192.168.8.114/23] from openshift-sdn
20-
Normal AddedInterface 24s multus Add net1 [198.19.0.1/16] from default/ipvlan-bond0
37+
$ kubectl get events --field-selector involvedObject.name=sfnettest --field-selector reason=AddedInterface
38+
LAST SEEN TYPE REASON OBJECT MESSAGE
39+
24s Normal AddedInterface pod/onload-sfnettest-client Add eth0 [192.168.6.203/23] from openshift-sdn
40+
24s Normal AddedInterface pod/onload-sfnettest-client Add net1 [198.19.1.1/16] from default/ipvlan-bond0
41+
24s Normal AddedInterface pod/onload-sfnettest-server Add eth0 [192.168.8.143/23] from openshift-sdn
42+
24s Normal AddedInterface pod/onload-sfnettest-server Add net1 [198.19.0.1/16] from default/ipvlan-bond0
2143
```
2244

2345
The server pod is already running the accelerated `sfnt-pingpong` instance.
2446

25-
Run the client from the `onload-sfnettest-client` pod:
47+
Run `sfnt-pingpong` as a client within the `onload-sfnettest-client` pod, which has an accelerated environment:
2648

2749
```sh
2850
kubectl exec onload-sfnettest-client -- sfnt-pingpong udp 198.19.0.1
2951
```
3052

31-
You will likely want to [use an Onload profile](../../../README.md#using-onload-profiles).
32-
3353
---
3454

35-
Copyright (c) 2023 Advanced Micro Devices, Inc.
55+
Copyright (c) 2023-2024 Advanced Micro Devices, Inc.

config/samples/sfnettest/client-server.yaml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
# SPDX-License-Identifier: MIT
2-
# SPDX-FileCopyrightText: (c) Copyright 2023 Advanced Micro Devices, Inc.
2+
# SPDX-FileCopyrightText: (c) Copyright 2023-2024 Advanced Micro Devices, Inc.
33
apiVersion: v1
44
kind: Pod
55
metadata:
66
name: onload-sfnettest-server
77
namespace: default
8+
labels:
9+
app.kubernetes.io/part-of: sfnettest
10+
app.kubernetes.io/component: server
811
annotations:
912
k8s.v1.cni.cncf.io/networks: ipvlan-bond0
1013
spec:
@@ -21,6 +24,9 @@ spec:
2124
resources:
2225
limits:
2326
amd.com/onload: 1
27+
envFrom:
28+
- configMapRef:
29+
name: onload-latency-profile
2430
securityContext:
2531
allowPrivilegeEscalation: false
2632
capabilities:
@@ -33,6 +39,9 @@ kind: Pod
3339
metadata:
3440
name: onload-sfnettest-client
3541
namespace: default
42+
labels:
43+
app.kubernetes.io/part-of: sfnettest
44+
app.kubernetes.io/component: client
3645
annotations:
3746
k8s.v1.cni.cncf.io/networks: ipvlan-bond0
3847
spec:
@@ -54,6 +63,9 @@ spec:
5463
resources:
5564
limits:
5665
amd.com/onload: 1
66+
envFrom:
67+
- configMapRef:
68+
name: onload-latency-profile
5769
securityContext:
5870
allowPrivilegeEscalation: false
5971
capabilities:

docs/MachineConfig-sfc.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
scripts/machineconfig/README.md
1+
../scripts/machineconfig/README.md

docs/sfnettest.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../config/samples/sfnettest/README.md

docs/sfptpd.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../config/samples/sfptpd/README.md

0 commit comments

Comments
 (0)