CUDA OOM middle of training dask + xgboost

I'm fitting with Dask & XGBoost. I've noticed failures at random points in training where 64GB is trying to be allocated. 

This seems to happen at a random round in the middle of training. I'm fitting for 100 boosting rounds and I've seen it happen at round 5, round 25 and even later in training.

The GPU memory usage from a profile looks like it is approx 90GB per device and stable before hitting a CUDA OOM error when XGBoost attempts to allocate 64GB of memory at a certain boosting round.

I'm fitting on 4 H200s with the latest xgboost and dask.


My code is roughly as follows:
```
cluster = LocalCUDACluster(
    n_workers=4, threads_per_worker=16, memory_limit=0.9, device_memory_limit=0.9
)
client = Client(cluster)

def load_data():
    ....
    futures = [
        client.submit(
            load_from_row_groups,
            batch_indices,
            columns,
            pure=False,
        )
        for batch_indices in all_batches
    ]

    f_x = [client.submit(lambda t: t[0], f) for f in futures]
    f_y = [client.submit(lambda t: t[1], f) for f in futures]
    f_w = [client.submit(lambda t: t[2], f) for f in futures]

    x_parts = [da.from_delayed(delayed(x), shape=(np.nan, len(x_cols)), dtype=np.float32) for x in f_x]
    y_parts = [da.from_delayed(delayed(y), shape=(np.nan, len(y_cols)), dtype=np.float32) for y in f_y]
    w_parts = [da.from_delayed(delayed(w), shape=(np.nan, len(w_cols)), dtype=np.float32) for w in f_w]

    x = da.concatenate(x_parts).persist()
    y = da.concatenate(y_parts).persist()
    w = da.concatenate(w_parts).persist()
    return x, y, w

x, y, z = load_data()

dtrain = dxgb.DaskQuantileDMatrix(
    client,
    x,
    y,
    weight=w,
    max_bin=self.model_config.max_bin,
    ref=None
)

dxgb.train(
            client,
            booster_config,
            dtrain
)
```

This is the error I ultimately get
```
2025-09-12 11:17:11.850Z ERROR    distributed.worker - Compute Failed
Key:       fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9
State:     executing
Task:  <Task 'fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9' fn(...)>
Exception: "XGBoostError('[11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 53.1098GB\\n- Requested memory: 64GB\\n\\nStack trace:\\n  [bt] (0) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7f62e6ea6e7c]\\n  [bt] (1) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xa5c493) [0x7f62e765c493]\\n  [bt] (2) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd8c1e) [0x7f62e7bd8c1e]\\n  [bt] (3) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9377) [0x7f62e7bd9377]\\n  [bt] (4) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd982c) [0x7f62e7bd982c]\\n  [bt] (5) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9fe0) [0x7f62e7bd9fe0]\\n  [bt] (6) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdd29a) [0x7f62e7bdd29a]\\n  [bt] (7) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdf44c) [0x7f62e7bdf44c]\\n  [bt] (8) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x63b5b2) [0x7f62e723b5b2]\\n\\n')"
Traceback: '  File "/venv/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 554, in fn\n    return [func(*args, **kwargs)]\n           

 xgboost.core.XGBoostError: [11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 53.1098GB
- Requested memory: 64GB
- ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CUDA OOM middle of training dask + xgboost #11684

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CUDA OOM middle of training dask + xgboost #11684

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions