Skip to content

CUDA OOM middle of training dask + xgboost #11684

@jamesram415

Description

@jamesram415

I'm fitting with Dask & XGBoost. I've noticed failures at random points in training where 64GB is trying to be allocated.

This seems to happen at a random round in the middle of training. I'm fitting for 100 boosting rounds and I've seen it happen at round 5, round 25 and even later in training.

The GPU memory usage from a profile looks like it is approx 90GB per device and stable before hitting a CUDA OOM error when XGBoost attempts to allocate 64GB of memory at a certain boosting round.

I'm fitting on 4 H200s with the latest xgboost and dask.

My code is roughly as follows:

cluster = LocalCUDACluster(
    n_workers=4, threads_per_worker=16, memory_limit=0.9, device_memory_limit=0.9
)
client = Client(cluster)

def load_data():
    ....
    futures = [
        client.submit(
            load_from_row_groups,
            batch_indices,
            columns,
            pure=False,
        )
        for batch_indices in all_batches
    ]

    f_x = [client.submit(lambda t: t[0], f) for f in futures]
    f_y = [client.submit(lambda t: t[1], f) for f in futures]
    f_w = [client.submit(lambda t: t[2], f) for f in futures]

    x_parts = [da.from_delayed(delayed(x), shape=(np.nan, len(x_cols)), dtype=np.float32) for x in f_x]
    y_parts = [da.from_delayed(delayed(y), shape=(np.nan, len(y_cols)), dtype=np.float32) for y in f_y]
    w_parts = [da.from_delayed(delayed(w), shape=(np.nan, len(w_cols)), dtype=np.float32) for w in f_w]

    x = da.concatenate(x_parts).persist()
    y = da.concatenate(y_parts).persist()
    w = da.concatenate(w_parts).persist()
    return x, y, w

x, y, z = load_data()

dtrain = dxgb.DaskQuantileDMatrix(
    client,
    x,
    y,
    weight=w,
    max_bin=self.model_config.max_bin,
    ref=None
)

dxgb.train(
            client,
            booster_config,
            dtrain
)

This is the error I ultimately get

2025-09-12 11:17:11.850Z ERROR    distributed.worker - Compute Failed
Key:       fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9
State:     executing
Task:  <Task 'fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9' fn(...)>
Exception: "XGBoostError('[11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 53.1098GB\\n- Requested memory: 64GB\\n\\nStack trace:\\n  [bt] (0) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7f62e6ea6e7c]\\n  [bt] (1) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xa5c493) [0x7f62e765c493]\\n  [bt] (2) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd8c1e) [0x7f62e7bd8c1e]\\n  [bt] (3) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9377) [0x7f62e7bd9377]\\n  [bt] (4) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd982c) [0x7f62e7bd982c]\\n  [bt] (5) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9fe0) [0x7f62e7bd9fe0]\\n  [bt] (6) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdd29a) [0x7f62e7bdd29a]\\n  [bt] (7) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdf44c) [0x7f62e7bdf44c]\\n  [bt] (8) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x63b5b2) [0x7f62e723b5b2]\\n\\n')"
Traceback: '  File "/venv/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 554, in fn\n    return [func(*args, **kwargs)]\n           

 xgboost.core.XGBoostError: [11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 53.1098GB
- Requested memory: 64GB
- ```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions