-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
I'm seeing persistent CUDA OOMs while fine-tuning on an H100 80GB GPU.
Symptoms:
- OOM occurs even after lowering batch size (per-device batch = 1).
- Initially discovered the DeepSpeed config (zero2.json) was missing, I added a
zero2.json
(ZeRO stage 2 with CPU optimizer offload), but the OOMs persisted. - Training runs for a while, GPU memory usage steadily increases then after ~10% progress CUDA OOM happens.
- I instrumented the run and found that the optimizer state is sometimes created mid-run, causing a large allocation spike. I forced a pre-init dummy forward and a tiny training step to materialize optimizer state up-front, but memory still climbs over time.
I need help diagnosing whether this is a bug / misconfiguration in my script + Trainer + DeepSpeed usage, or if I need to reconfigure something else.
Environment
(please replace with exact values you used)
- OS: Ubuntu 22.04
- GPU: NVIDIA H100 (80 GB)
- CUDA: 12.2
- Python: 3.9
Zero2.json
{
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"overlap_comm": true,
"contiguous_gradients": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 4e-5,
"betas": [0.9, 0.95],
"eps": 1e-7,
"weight_decay": 0.01
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 4e-5,
"warmup_num_steps": 50
}
},
"fp16": {
"enabled": false
},
"bf16": {
"enabled": true
},
"zero_allow_untested_optimizer": true,
"steps_per_print": 200
}
Error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.59 GiB. GPU 0 has a total capacity of 79.11 GiB of which 20.15 GiB is free. Including non-PyTorch memory, this process has 58.95 GiB memory in use. Of the allocated memory 55.45 GiB is allocated by PyTorch, and 2.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) 10%|██████▏ | 554/5670 [48:24<7:26:58, 5.24s/it]