Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,111 @@ It is possible to specify Annotations when submitting Pods to the K8S cluster. A
| slurm-job.vk.io/image-root | Used to specify the root path of the Singularity Image |
| slurm-job.vk.io/flags | Used to specify SLURM flags. These flags will be added to the SLURM script in the form of #SBATCH flag1, #SBATCH flag2, etc |
| slurm-job.vk.io/mpi-flags | Used to prepend "mpiexec -np $SLURM_NTASKS \*flags\*" to the Singularity Execution |
| slurm-job.vk.io/flavor | Used to explicitly select a flavor configuration (e.g., "gpu-nvidia", "high-io") |

### :art: Flavor System

The SLURM plugin supports "flavors" - predefined configurations that provide default resource values and SLURM-specific settings. This simplifies pod definitions and ensures consistent resource allocation across jobs.

#### How Flavors Work

Flavors are resolved in the following priority order:
1. **Explicit annotation**: `slurm-job.vk.io/flavor: "flavor-name"`
2. **Auto-detection**: GPU resources automatically select GPU flavors (exact GPU count match preferred)
3. **Default flavor**: Falls back to the flavor specified in `DefaultFlavor` config

#### Configuring Flavors

Add flavors to your `SlurmConfig.yaml`:

```yaml
DefaultFlavor: "default"
Flavors:
default:
Name: "default"
Description: "Standard CPU job (4 cores, 16GB RAM)"
CPUDefault: 4
MemoryDefault: "16G"
SlurmFlags:
- "--partition=cpu"
- "--time=01:00:00"

gpu-nvidia:
Name: "gpu-nvidia"
Description: "GPU job with NVIDIA GPU (8 cores, 64GB RAM, 1 GPU)"
CPUDefault: 8
MemoryDefault: "64G"
SlurmFlags:
- "--gres=gpu:1"
- "--partition=gpu"
- "--time=04:00:00"

high-io:
Name: "high-io"
Description: "High I/O job (16 cores, 32GB RAM, fast storage)"
CPUDefault: 16
MemoryDefault: "32G"
SlurmFlags:
- "--partition=fast-io"
- "--constraint=ssd"
```

#### Flavor Behavior

- **Default Resources**: Flavor CPU/memory defaults apply ONLY when pod doesn't specify resource limits
- **Pod Overrides**: If pod specifies resource limits, those take precedence over flavor defaults
- **SLURM Flag Priority**: Flavor flags < Annotation flags < Pod resource limits
- **Flag Deduplication**: Duplicate flags are automatically removed, with later flags overriding earlier ones

#### Example: Using Flavors

**Example 1: Auto-detected GPU flavor**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-job
spec:
containers:
- name: pytorch
image: docker://pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1 # Automatically selects "gpu-nvidia" flavor
```

**Example 2: Explicit flavor selection**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: io-intensive-job
annotations:
slurm-job.vk.io/flavor: "high-io"
spec:
containers:
- name: data-processor
image: docker://myapp:latest
# Will use high-io flavor's 16 CPU and 32GB RAM defaults
```

**Example 3: Pod resources override flavor defaults**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: custom-resources
annotations:
slurm-job.vk.io/flavor: "default"
spec:
containers:
- name: app
image: docker://myapp:latest
resources:
limits:
cpu: "32" # Overrides flavor's 4 CPU default
memory: "128Gi" # Overrides flavor's 16GB default
```

### :gear: Explanation of the SLURM Config file

Expand Down Expand Up @@ -156,6 +261,8 @@ building the docker image (`docker compose up -d --build --force-recreate` will
| VerboseLogging | Enable or disable Debug messages on logs. True or False values only |
| ErrorsOnlyLogging | Specify if you want to get errors only on logs. True or false values only |
| EnableProbes | Enable or disable health and readiness probes. True or False values only |
| Flavors | Map of flavor configurations. Each flavor can specify CPUDefault, MemoryDefault, and SlurmFlags. See Flavor System section above for details |
| DefaultFlavor | Name of the default flavor to use when no explicit flavor is specified and no auto-detection applies |

### :wrench: Environment Variables list

Expand Down
20 changes: 20 additions & 0 deletions docker/SlurmConfig.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,23 @@ ContainerRuntime: singularity
EnrootDefaultOptions: ["--rw"]
EnrootPrefix: ""
EnrootPath: enroot

# Flavor configuration - predefined sets of SLURM submission options
DefaultFlavor: "default"
Flavors:
default:
Name: "default"
Description: "Standard CPU job (2 cores, 8GB RAM)"
CPUDefault: 2
MemoryDefault: "8G"
SlurmFlags:
- "--partition=normal"

gpu-nvidia:
Name: "gpu-nvidia"
Description: "GPU job with NVIDIA GPU (4 cores, 32GB RAM, 1 GPU)"
CPUDefault: 4
MemoryDefault: "32G"
SlurmFlags:
- "--gres=gpu:1"
- "--partition=gpu"
32 changes: 32 additions & 0 deletions examples/config/SlurmConfig.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,35 @@ BashPath: /bin/bash
VerboseLogging: true
ErrorsOnlyLogging: false
EnableProbes: true

# Flavor configuration - predefined sets of SLURM submission options
DefaultFlavor: "default"
Flavors:
default:
Name: "default"
Description: "Standard CPU job (4 cores, 16GB RAM)"
CPUDefault: 4
MemoryDefault: "16G"
SlurmFlags:
- "--partition=cpu"
- "--time=01:00:00"

gpu-nvidia:
Name: "gpu-nvidia"
Description: "GPU job with NVIDIA GPU (8 cores, 64GB RAM, 1 GPU)"
CPUDefault: 8
MemoryDefault: "64G"
SlurmFlags:
- "--gres=gpu:1"
- "--partition=gpu"
- "--time=04:00:00"

high-io:
Name: "high-io"
Description: "High I/O job (16 cores, 32GB RAM, fast storage)"
CPUDefault: 16
MemoryDefault: "32G"
SlurmFlags:
- "--partition=fast-io"
- "--time=02:00:00"
- "--constraint=ssd"
52 changes: 35 additions & 17 deletions pkg/slurm/Create.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,21 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
metadata := data.Pod.ObjectMeta
filesPath := h.Config.DataRootFolder + data.Pod.Namespace + "-" + string(data.Pod.UID)

// Resolve flavor to apply default CPU and memory
flavor, err := resolveFlavor(spanCtx, h.Config, metadata, data.Pod.Spec.Containers)
if err != nil {
log.G(h.Ctx).Error("Failed to resolve flavor: ", err)
statusCode = http.StatusInternalServerError
h.handleError(spanCtx, w, statusCode, err)
return
}

var runtime_command_pod []ContainerCommand
var resourceLimits ResourceLimits

isDefaultCPU := true
isDefaultRam := true

maxCPULimit := 0
maxMemoryLimit := 0

cpuLimit := int64(0)
memoryLimit := int64(0)

Expand All @@ -79,28 +85,40 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {

cpuLimitFromContainer := int64(math.Ceil(cpuLimitFloat))

if cpuLimitFromContainer == 0 && isDefaultCPU {
log.G(h.Ctx).Warning(errors.New("Max CPU resource not set for " + container.Name + ". Only 1 CPU will be used"))
resourceLimits.CPU = 1
if cpuLimitFromContainer == 0 {
// No CPU limit specified in container, check if we should use flavor default
if isDefaultCPU && flavor != nil && flavor.CPUDefault > 0 {
log.G(h.Ctx).Infof("Max CPU resource not set for %s. Using flavor '%s' default: %d CPU", container.Name, flavor.FlavorName, flavor.CPUDefault)
cpuLimit = flavor.CPUDefault
} else if isDefaultCPU {
log.G(h.Ctx).Warning(errors.New("Max CPU resource not set for " + container.Name + ". Only 1 CPU will be used"))
cpuLimit = 1
}
} else {
if cpuLimitFromContainer > resourceLimits.CPU && maxCPULimit < int(cpuLimitFromContainer) {
// Container specified CPU limit
if cpuLimitFromContainer > cpuLimit {
log.G(h.Ctx).Info("Setting CPU limit to " + strconv.FormatInt(cpuLimitFromContainer, 10))
cpuLimit = cpuLimitFromContainer
maxCPULimit = int(cpuLimitFromContainer)
isDefaultCPU = false
}
isDefaultCPU = false
}

if memoryLimitFromContainer == 0 && isDefaultRam {
log.G(h.Ctx).Warning(errors.New("Max Memory resource not set for " + container.Name + ". Only 1MB will be used"))
resourceLimits.Memory = 1024 * 1024
if memoryLimitFromContainer == 0 {
// No memory limit specified in container, check if we should use flavor default
if isDefaultRam && flavor != nil && flavor.MemoryDefault > 0 {
log.G(h.Ctx).Infof("Max Memory resource not set for %s. Using flavor '%s' default: %d bytes", container.Name, flavor.FlavorName, flavor.MemoryDefault)
memoryLimit = flavor.MemoryDefault
} else if isDefaultRam {
log.G(h.Ctx).Warning(errors.New("Max Memory resource not set for " + container.Name + ". Only 1MB will be used"))
memoryLimit = 1024 * 1024
}
} else {
if memoryLimitFromContainer > resourceLimits.Memory && maxMemoryLimit < int(memoryLimitFromContainer) {
// Container specified memory limit
if memoryLimitFromContainer > memoryLimit {
log.G(h.Ctx).Info("Setting Memory limit to " + strconv.FormatInt(memoryLimitFromContainer, 10))
memoryLimit = memoryLimitFromContainer
maxMemoryLimit = int(memoryLimitFromContainer)
isDefaultRam = false
}
isDefaultRam = false
}

resourceLimits.CPU = cpuLimit
Expand Down Expand Up @@ -183,7 +201,7 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {

if data.JobScript == "" {
log.G(h.Ctx).Info("-- No custom job script provided, generating one...")
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam)
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam, flavor)
if err != nil {
log.G(h.Ctx).Error(err)
os.RemoveAll(filesPath)
Expand Down Expand Up @@ -229,7 +247,7 @@ func (h *SidecarHandler) SubmitHandler(w http.ResponseWriter, r *http.Request) {
containerImage: "n/a",
})

path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam)
path, err = produceSLURMScript(spanCtx, h.Config, data.Pod, filesPath, metadata, runtime_command_pod, resourceLimits, isDefaultCPU, isDefaultRam, flavor)
if err != nil {
log.G(h.Ctx).Error(err)
os.RemoveAll(filesPath)
Expand Down
Loading
Loading