-
Notifications
You must be signed in to change notification settings - Fork 36
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Please provide a detailed description of the issue you encountered.
Environment Information
- Python Version: 3.12.4
- GPU: NVIDIA L20-40G * 8
- CUDA Version: 12.4
- Installation Method: git clone
- Trinity-RFT Version: 0.3.0.dev0
Steps to Reproduce
Please provide a minimal, self-contained, and reproducible example.
- trinity run --config examples/XXX/XXX.yaml
Actual Behavior
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.76 GiB.
GPU 0 has a total capacity of 44.52 GiB of which only 2–3 GiB is free.
This process has ~41 GiB memory in use. Of the allocated memory, ~36 GiB is used by PyTorch.
ray.exceptions.RayTaskError(OutOfMemoryError):
WorkerDict.actor_update_actor() failed with CUDA OOM
To reiterate, I am using 8 L20 GPUs with a batch size reduced to 16, max_response_tokens set to 2048, and repeat_times set to 8. However, errors still occur randomly.
Log Information
If applicable, include any relevant log output here.




Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working