Skip to content

[BUG] CUDA OutOfMemoryError during backward pass in distributed training #284

@Bat-Reality

Description

@Bat-Reality

Bug Description

Please provide a detailed description of the issue you encountered.

Environment Information

  • Python Version: 3.12.4
  • GPU: NVIDIA L20-40G * 8
  • CUDA Version: 12.4
  • Installation Method: git clone
  • Trinity-RFT Version: 0.3.0.dev0

Steps to Reproduce

Please provide a minimal, self-contained, and reproducible example.

  1. trinity run --config examples/XXX/XXX.yaml

Actual Behavior

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.76 GiB. 
GPU 0 has a total capacity of 44.52 GiB of which only 23 GiB is free. 
This process has ~41 GiB memory in use. Of the allocated memory, ~36 GiB is used by PyTorch.

ray.exceptions.RayTaskError(OutOfMemoryError): 
WorkerDict.actor_update_actor() failed with CUDA OOM

To reiterate, I am using 8 L20 GPUs with a batch size reduced to 16, max_response_tokens set to 2048, and repeat_times set to 8. However, errors still occur randomly.

Log Information

If applicable, include any relevant log output here.

Image Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions