-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Open
Description
I'm fitting with Dask & XGBoost. I've noticed failures at random points in training where 64GB is trying to be allocated.
This seems to happen at a random round in the middle of training. I'm fitting for 100 boosting rounds and I've seen it happen at round 5, round 25 and even later in training.
The GPU memory usage from a profile looks like it is approx 90GB per device and stable before hitting a CUDA OOM error when XGBoost attempts to allocate 64GB of memory at a certain boosting round.
I'm fitting on 4 H200s with the latest xgboost and dask.
My code is roughly as follows:
cluster = LocalCUDACluster(
n_workers=4, threads_per_worker=16, memory_limit=0.9, device_memory_limit=0.9
)
client = Client(cluster)
def load_data():
....
futures = [
client.submit(
load_from_row_groups,
batch_indices,
columns,
pure=False,
)
for batch_indices in all_batches
]
f_x = [client.submit(lambda t: t[0], f) for f in futures]
f_y = [client.submit(lambda t: t[1], f) for f in futures]
f_w = [client.submit(lambda t: t[2], f) for f in futures]
x_parts = [da.from_delayed(delayed(x), shape=(np.nan, len(x_cols)), dtype=np.float32) for x in f_x]
y_parts = [da.from_delayed(delayed(y), shape=(np.nan, len(y_cols)), dtype=np.float32) for y in f_y]
w_parts = [da.from_delayed(delayed(w), shape=(np.nan, len(w_cols)), dtype=np.float32) for w in f_w]
x = da.concatenate(x_parts).persist()
y = da.concatenate(y_parts).persist()
w = da.concatenate(w_parts).persist()
return x, y, w
x, y, z = load_data()
dtrain = dxgb.DaskQuantileDMatrix(
client,
x,
y,
weight=w,
max_bin=self.model_config.max_bin,
ref=None
)
dxgb.train(
client,
booster_config,
dtrain
)
This is the error I ultimately get
2025-09-12 11:17:11.850Z ERROR distributed.worker - Compute Failed
Key: fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9
State: executing
Task: <Task 'fn-a5e41b0d-cd3c-4cf9-b7bf-f497901989e9' fn(...)>
Exception: "XGBoostError('[11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory\\n- Free memory: 53.1098GB\\n- Requested memory: 64GB\\n\\nStack trace:\\n [bt] (0) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x2a6e7c) [0x7f62e6ea6e7c]\\n [bt] (1) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xa5c493) [0x7f62e765c493]\\n [bt] (2) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd8c1e) [0x7f62e7bd8c1e]\\n [bt] (3) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9377) [0x7f62e7bd9377]\\n [bt] (4) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd982c) [0x7f62e7bd982c]\\n [bt] (5) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfd9fe0) [0x7f62e7bd9fe0]\\n [bt] (6) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdd29a) [0x7f62e7bdd29a]\\n [bt] (7) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0xfdf44c) [0x7f62e7bdf44c]\\n [bt] (8) /venv/lib/python3.12/site-packages/xgboost/lib/libxgboost.so(+0x63b5b2) [0x7f62e723b5b2]\\n\\n')"
Traceback: ' File "/venv/lib/python3.12/site-packages/xgboost/dask/__init__.py", line 554, in fn\n return [func(*args, **kwargs)]\n
xgboost.core.XGBoostError: [11:17:11] /workspace/src/common/device_vector.cu:23: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 53.1098GB
- Requested memory: 64GB
- ```
Metadata
Metadata
Assignees
Labels
No labels