-
Notifications
You must be signed in to change notification settings - Fork 29.7k
Closed
Description
Hi!
I've been loosely following the recent conversations on bugs/issues such as #34198 #34191. As a lay user, it's not entirely clear to me what the issue is.
To hone in on specific questions, as someone who wants to use Trainer with a custom loss function, I'm concerned that there are numerous factors I need to account for, which leads to the following questions:
- When providing a custom
compute_loss_func
, Is the expectation to divide the loss by num_items_in_batch (batch_size * gradient accumulation)? To confirm my understanding, the gradient accumulation is handled by just breaking up each "step" of the "effective batch size" into smaller steps, so the outputs and labels provided already account for gradient accumulation, and the loss just needs to be divided by num_items_in_batch. - I'm seeing open issues, such as Loss is incorrectly scaled in Trainer during the last step with gradient accumulation when the final batch is smaller than accumulation steps. #38837, which appear to be unique to the last step when the gradient accumulation isn't nicely divided. Is this still an issue?
- Do any of these dynamics change in a multi-GPU setup?
Metadata
Metadata
Assignees
Labels
No labels