Skip to content

torch.OutOfMemoryError: CUDA out of memory during training (H100 80GB) — memory steadily increases despite reduced batch size and ZeRO2 offload #1

@Biyani404198

Description

@Biyani404198

Summary

I'm seeing persistent CUDA OOMs while fine-tuning on an H100 80GB GPU.
Symptoms:

  • OOM occurs even after lowering batch size (per-device batch = 1).
  • Initially discovered the DeepSpeed config (zero2.json) was missing, I added a zero2.json (ZeRO stage 2 with CPU optimizer offload), but the OOMs persisted.
  • Training runs for a while, GPU memory usage steadily increases then after ~10% progress CUDA OOM happens.
  • I instrumented the run and found that the optimizer state is sometimes created mid-run, causing a large allocation spike. I forced a pre-init dummy forward and a tiny training step to materialize optimizer state up-front, but memory still climbs over time.

I need help diagnosing whether this is a bug / misconfiguration in my script + Trainer + DeepSpeed usage, or if I need to reconfigure something else.


Environment

(please replace with exact values you used)

  • OS: Ubuntu 22.04
  • GPU: NVIDIA H100 (80 GB)
  • CUDA: 12.2
  • Python: 3.9

Zero2.json

{
"train_micro_batch_size_per_gpu": 1,
"train_batch_size": 1,
"gradient_accumulation_steps": 1,

"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"overlap_comm": true,
"contiguous_gradients": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 4e-5,
"betas": [0.9, 0.95],
"eps": 1e-7,
"weight_decay": 0.01
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 4e-5,
"warmup_num_steps": 50
}
},
"fp16": {
"enabled": false
},
"bf16": {
"enabled": true
},
"zero_allow_untested_optimizer": true,
"steps_per_print": 200
}

Error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.59 GiB. GPU 0 has a total capacity of 79.11 GiB of which 20.15 GiB is free. Including non-PyTorch memory, this process has 58.95 GiB memory in use. Of the allocated memory 55.45 GiB is allocated by PyTorch, and 2.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) 10%|██████▏ | 554/5670 [48:24<7:26:58, 5.24s/it]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions