Skip to content

Commit 45ad462

Browse files
committed
refactor: improve flavor system implementation
- Fix flavor default application to only apply when pod doesn't specify limits - Resolve flavor once and pass to produceSLURMScript to avoid redundant resolution - Fix pointer to loop variable issue by creating explicit copies - Improve GPU flavor matching to prefer exact GPU count match - Add SLURM flag deduplication with proper priority handling - Add FlavorConfig validation method with startup checks - Add comprehensive unit tests for all flavor functions (43 new tests) - Update README with detailed flavor system documentation and examples - Remove unused variables (maxCPULimit, maxMemoryLimit) This fixes critical bugs where flavor defaults would override pod-specified resources, and adds proper validation and testing for the flavor system. Signed-off-by: Diego Ciangottini <diego.ciangottini@pg.infn.it>
1 parent c64b90d commit 45ad462

File tree

6 files changed

+696
-62
lines changed

6 files changed

+696
-62
lines changed

README.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,111 @@ It is possible to specify Annotations when submitting Pods to the K8S cluster. A
128128
| slurm-job.vk.io/image-root | Used to specify the root path of the Singularity Image |
129129
| slurm-job.vk.io/flags | Used to specify SLURM flags. These flags will be added to the SLURM script in the form of #SBATCH flag1, #SBATCH flag2, etc |
130130
| slurm-job.vk.io/mpi-flags | Used to prepend "mpiexec -np $SLURM_NTASKS \*flags\*" to the Singularity Execution |
131+
| slurm-job.vk.io/flavor | Used to explicitly select a flavor configuration (e.g., "gpu-nvidia", "high-io") |
132+
133+
### :art: Flavor System
134+
135+
The SLURM plugin supports "flavors" - predefined configurations that provide default resource values and SLURM-specific settings. This simplifies pod definitions and ensures consistent resource allocation across jobs.
136+
137+
#### How Flavors Work
138+
139+
Flavors are resolved in the following priority order:
140+
1. **Explicit annotation**: `slurm-job.vk.io/flavor: "flavor-name"`
141+
2. **Auto-detection**: GPU resources automatically select GPU flavors (exact GPU count match preferred)
142+
3. **Default flavor**: Falls back to the flavor specified in `DefaultFlavor` config
143+
144+
#### Configuring Flavors
145+
146+
Add flavors to your `SlurmConfig.yaml`:
147+
148+
```yaml
149+
DefaultFlavor: "default"
150+
Flavors:
151+
default:
152+
Name: "default"
153+
Description: "Standard CPU job (4 cores, 16GB RAM)"
154+
CPUDefault: 4
155+
MemoryDefault: "16G"
156+
SlurmFlags:
157+
- "--partition=cpu"
158+
- "--time=01:00:00"
159+
160+
gpu-nvidia:
161+
Name: "gpu-nvidia"
162+
Description: "GPU job with NVIDIA GPU (8 cores, 64GB RAM, 1 GPU)"
163+
CPUDefault: 8
164+
MemoryDefault: "64G"
165+
SlurmFlags:
166+
- "--gres=gpu:1"
167+
- "--partition=gpu"
168+
- "--time=04:00:00"
169+
170+
high-io:
171+
Name: "high-io"
172+
Description: "High I/O job (16 cores, 32GB RAM, fast storage)"
173+
CPUDefault: 16
174+
MemoryDefault: "32G"
175+
SlurmFlags:
176+
- "--partition=fast-io"
177+
- "--constraint=ssd"
178+
```
179+
180+
#### Flavor Behavior
181+
182+
- **Default Resources**: Flavor CPU/memory defaults apply ONLY when pod doesn't specify resource limits
183+
- **Pod Overrides**: If pod specifies resource limits, those take precedence over flavor defaults
184+
- **SLURM Flag Priority**: Flavor flags < Annotation flags < Pod resource limits
185+
- **Flag Deduplication**: Duplicate flags are automatically removed, with later flags overriding earlier ones
186+
187+
#### Example: Using Flavors
188+
189+
**Example 1: Auto-detected GPU flavor**
190+
```yaml
191+
apiVersion: v1
192+
kind: Pod
193+
metadata:
194+
name: gpu-job
195+
spec:
196+
containers:
197+
- name: pytorch
198+
image: docker://pytorch/pytorch:latest
199+
resources:
200+
limits:
201+
nvidia.com/gpu: 1 # Automatically selects "gpu-nvidia" flavor
202+
```
203+
204+
**Example 2: Explicit flavor selection**
205+
```yaml
206+
apiVersion: v1
207+
kind: Pod
208+
metadata:
209+
name: io-intensive-job
210+
annotations:
211+
slurm-job.vk.io/flavor: "high-io"
212+
spec:
213+
containers:
214+
- name: data-processor
215+
image: docker://myapp:latest
216+
# Will use high-io flavor's 16 CPU and 32GB RAM defaults
217+
```
218+
219+
**Example 3: Pod resources override flavor defaults**
220+
```yaml
221+
apiVersion: v1
222+
kind: Pod
223+
metadata:
224+
name: custom-resources
225+
annotations:
226+
slurm-job.vk.io/flavor: "default"
227+
spec:
228+
containers:
229+
- name: app
230+
image: docker://myapp:latest
231+
resources:
232+
limits:
233+
cpu: "32" # Overrides flavor's 4 CPU default
234+
memory: "128Gi" # Overrides flavor's 16GB default
235+
```
131236

132237
### :gear: Explanation of the SLURM Config file
133238

@@ -156,6 +261,8 @@ building the docker image (`docker compose up -d --build --force-recreate` will
156261
| VerboseLogging | Enable or disable Debug messages on logs. True or False values only |
157262
| ErrorsOnlyLogging | Specify if you want to get errors only on logs. True or false values only |
158263
| EnableProbes | Enable or disable health and readiness probes. True or False values only |
264+
| Flavors | Map of flavor configurations. Each flavor can specify CPUDefault, MemoryDefault, and SlurmFlags. See Flavor System section above for details |
265+
| DefaultFlavor | Name of the default flavor to use when no explicit flavor is specified and no auto-detection applies |
159266

160267
### :wrench: Environment Variables list
161268

pkg/slurm/Create.go

Lines changed: 26 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -72,26 +72,9 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
7272
isDefaultCPU := true
7373
isDefaultRam := true
7474

75-
maxCPULimit := 0
76-
maxMemoryLimit := 0
77-
7875
cpuLimit := int64(0)
7976
memoryLimit := int64(0)
8077

81-
// Apply flavor defaults if available
82-
if flavor != nil {
83-
if flavor.CPUDefault > 0 {
84-
cpuLimit = flavor.CPUDefault
85-
maxCPULimit = int(flavor.CPUDefault)
86-
log.G(h.Ctx).Infof("Applying CPU default from flavor '%s': %d", flavor.FlavorName, flavor.CPUDefault)
87-
}
88-
if flavor.MemoryDefault > 0 {
89-
memoryLimit = flavor.MemoryDefault
90-
maxMemoryLimit = int(flavor.MemoryDefault)
91-
log.G(h.Ctx).Infof("Applying memory default from flavor '%s': %d bytes", flavor.FlavorName, flavor.MemoryDefault)
92-
}
93-
}
94-
9578
for i, container := range containers {
9679
log.G(h.Ctx).Info("- Beginning script generation for container " + container.Name)
9780

@@ -102,28 +85,40 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
10285

10386
cpuLimitFromContainer := int64(math.Ceil(cpuLimitFloat))
10487

105-
if cpuLimitFromContainer == 0 && isDefaultCPU {
106-
log.G(h.Ctx).Warning(errors.New("Max CPU resource not set for " + container.Name + ". Only 1 CPU will be used"))
107-
resourceLimits.CPU = 1
88+
if cpuLimitFromContainer == 0 {
89+
// No CPU limit specified in container, check if we should use flavor default
90+
if isDefaultCPU && flavor != nil && flavor.CPUDefault > 0 {
91+
log.G(h.Ctx).Infof("Max CPU resource not set for %s. Using flavor '%s' default: %d CPU", container.Name, flavor.FlavorName, flavor.CPUDefault)
92+
cpuLimit = flavor.CPUDefault
93+
} else if isDefaultCPU {
94+
log.G(h.Ctx).Warning(errors.New("Max CPU resource not set for " + container.Name + ". Only 1 CPU will be used"))
95+
cpuLimit = 1
96+
}
10897
} else {
109-
if cpuLimitFromContainer > resourceLimits.CPU && maxCPULimit < int(cpuLimitFromContainer) {
98+
// Container specified CPU limit
99+
if cpuLimitFromContainer > cpuLimit {
110100
log.G(h.Ctx).Info("Setting CPU limit to " + strconv.FormatInt(cpuLimitFromContainer, 10))
111101
cpuLimit = cpuLimitFromContainer
112-
maxCPULimit = int(cpuLimitFromContainer)
113-
isDefaultCPU = false
114102
}
103+
isDefaultCPU = false
115104
}
116105

117-
if memoryLimitFromContainer == 0 && isDefaultRam {
118-
log.G(h.Ctx).Warning(errors.New("Max Memory resource not set for " + container.Name + ". Only 1MB will be used"))
119-
resourceLimits.Memory = 1024 * 1024
106+
if memoryLimitFromContainer == 0 {
107+
// No memory limit specified in container, check if we should use flavor default
108+
if isDefaultRam && flavor != nil && flavor.MemoryDefault > 0 {
109+
log.G(h.Ctx).Infof("Max Memory resource not set for %s. Using flavor '%s' default: %d bytes", container.Name, flavor.FlavorName, flavor.MemoryDefault)
110+
memoryLimit = flavor.MemoryDefault
111+
} else if isDefaultRam {
112+
log.G(h.Ctx).Warning(errors.New("Max Memory resource not set for " + container.Name + ". Only 1MB will be used"))
113+
memoryLimit = 1024 * 1024
114+
}
120115
} else {
121-
if memoryLimitFromContainer > resourceLimits.Memory && maxMemoryLimit < int(memoryLimitFromContainer) {
116+
// Container specified memory limit
117+
if memoryLimitFromContainer > memoryLimit {
122118
log.G(h.Ctx).Info("Setting Memory limit to " + strconv.FormatInt(memoryLimitFromContainer, 10))
123119
memoryLimit = memoryLimitFromContainer
124-
maxMemoryLimit = int(memoryLimitFromContainer)
125-
isDefaultRam = false
126120
}
121+
isDefaultRam = false
127122
}
128123

129124
resourceLimits.CPU = cpuLimit
@@ -206,7 +201,7 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
206201

207202
if data.JobScript == "" {
208203
log.G(h.Ctx).Info("-- No custom job script provided, generating one...")
209-
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam)
204+
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam, flavor)
210205
if err != nil {
211206
log.G(h.Ctx).Error(err)
212207
os.RemoveAll(filesPath)
@@ -252,7 +247,7 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
252247
containerImage: "n/a",
253248
})
254249

255-
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam)
250+
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam, flavor)
256251
if err != nil {
257252
log.G(h.Ctx).Error(err)
258253
os.RemoveAll(filesPath)

0 commit comments

Comments
 (0)