Skip to content

Clarification on Recent Changes to Loss and Gradient Accumulation #39567

@jiosephlee

Description

@jiosephlee

Hi!

I've been loosely following the recent conversations on bugs/issues such as #34198 #34191. As a lay user, it's not entirely clear to me what the issue is.

To hone in on specific questions, as someone who wants to use Trainer with a custom loss function, I'm concerned that there are numerous factors I need to account for, which leads to the following questions:

  1. When providing a custom compute_loss_func, Is the expectation to divide the loss by num_items_in_batch (batch_size * gradient accumulation)? To confirm my understanding, the gradient accumulation is handled by just breaking up each "step" of the "effective batch size" into smaller steps, so the outputs and labels provided already account for gradient accumulation, and the loss just needs to be divided by num_items_in_batch.
  2. I'm seeing open issues, such as Loss is incorrectly scaled in Trainer during the last step with gradient accumulation when the final batch is smaller than accumulation steps. #38837, which appear to be unique to the last step when the gradient accumulation isn't nicely divided. Is this still an issue?
  3. Do any of these dynamics change in a multi-GPU setup?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions