Clarification on Recent Changes to Loss and Gradient Accumulation

Hi!

I've been loosely following the recent conversations on bugs/issues such as https://github.com/huggingface/transformers/pull/34198 https://github.com/huggingface/transformers/pull/34191. As a lay user, it's not entirely clear to me what the issue is.

To hone in on specific questions, as someone who wants to use Trainer with a custom loss function, I'm concerned that there are numerous factors I need to account for, which leads to the following questions:
1. When providing a custom `compute_loss_func`, Is the expectation to divide the loss by num_items_in_batch (batch_size * gradient accumulation)? To confirm my understanding, the gradient accumulation is handled by just breaking up each "step" of the "effective batch size" into smaller steps, so the outputs and labels provided already account for gradient accumulation, and the loss just needs to be divided by num_items_in_batch.
2. I'm seeing open issues, such as https://github.com/huggingface/transformers/issues/38837, which appear to be unique to the last step when the gradient accumulation isn't nicely divided. Is this still an issue?
3. Do any of these dynamics change in a multi-GPU setup?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Recent Changes to Loss and Gradient Accumulation #39567

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on Recent Changes to Loss and Gradient Accumulation #39567

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions