Vanishing Jacobian problem #343

MatthewChen37 · 2025-05-12T04:19:24Z

MatthewChen37
May 12, 2025

Dear TorchJD maintainers,

First, thank you for this incredible work, as I could easily integrate it into my research on multi-task learning for pre-training foundation models. I’m not sure if this is an issue you have encountered before, or even well-known, but I have seemed to run into the Jacobian analogue of the vanishing gradient problem. After several training steps while pre-training my foundation model with TorchJD, the Jacobian values of the tensors connected to the shared features have turned into NaNs during backpropagation:

 File ".local/lib/python3.10/site-packages/torchjd/autojac/mtl_backward.py", line 117, in mtl_backward
    backward_transform(EmptyTensorDict())
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
    intermediate = self.inner(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 27, in __call__
    return self.transform(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
    intermediate = self.inner(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 49, in __call__
    return self._aggregate_group(ordered_matrices, self.aggregator)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 83, in _aggregate_group
    united_gradient_vector = aggregator(united_jacobian_matrix)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 36, in __call__
    return super().__call__(matrix)
  File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 89, in forward
    self._check_is_finite(matrix)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 23, in _check_is_finite
    raise ValueError(
ValueError: Parameter `matrix` should be a tensor of finite elements (no nan, inf or -inf values). Found `matrix = tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')`. Matrix shape: torch.Size([6, 106799]).

Note: I modified the bases.py file to print out the shapes of the offending matrices when the error is thrown.

Have you encountered this issue before, whether this is a bug on my end or a well-known problem in Jacobian descent?

I currently have multiple contrastive losses during pre-training that manipulate embedding outputs at different layers of the model, i.e., intermediate losses between layers. These losses specify the "shared_features" as the embeddings outputted before the first loss, which I suspect could be something I am implementing wrong on my end. Let me know if you have any other questions about my training setup to help understand the issue, but I’m afraid I’m not allowed to share everything.

Sincerely,
Matthew Chen

ValerianRey · 2025-05-12T11:01:21Z

ValerianRey
May 12, 2025
Maintainer

Hi! Thanks for your interest in TorchJD!

It seems that the training of your first loss is quite unstable, making the model diverge (or at least the part of the model leading to your first loss) which leads to this issue of having nan values in the gradient. If this comes from an exploding gradient issue, you could try to use BatchNorm1d / BatchNorm2d layers throughout your network (or more modern alternatives) to mitigate the exploding/vanishing gradient issue. It could also be unrelated to that, and simply be that your first loss has too large magnitude compared to others. You can either

Try using UPGrad with a preference vector with a lower value for the first objective

...
aggregator = UPGrad(pref_vector=torch.tensor([0.1, 1., 1., 1., 1., 1.]))
...

Try to scale down your first loss before running the backward pass

...
torchjd.mtl_backward([loss1 / 10, loss2, loss3, loss4, loss5, loss6], features=features, aggregator=aggregator)
...

I'm not sure why the other rows of the jacobian are zero. This could be due to a vanishing gradient problem. Is this always the case, even at the very first iteration of training? If so, there could be an issue in the way you use the torchjd.mtl_backward function.

A possible way to debug would also be to use the Mean() aggregator until your issues are fixed. It should make jacobian descent equivalent to gradient descent on the average loss.

I hope this will help!

0 replies

MatthewChen37 · 2025-05-12T22:21:30Z

MatthewChen37
May 12, 2025
Author

Hi,

Thanks for the reply. I already have some layer norms within my model’s “Jacobian Descent/shared_features” part. I’ll try experimenting with the pref vector. I currently have the pref_vector slightly emphasizing the first task more, as I want it to be optimized more heavily:

aggregator = UPGrad(pref_vector=torch.tensor([1.0, .8, .8, .8, .8, .8]))

This error usually appears deep into the training process (around the 9th or 10th training epoch). Something else to note is that I am using a CosineAnneallingWithWarmRestarts scheduler, which usually occurs when the learning rate "resets" from something small like 10^-7 back to 10^-3.

I’ll try experimenting some more.

1 reply

ValerianRey May 13, 2025
Maintainer

My guess is that you have too large of a learning rate for the first objective. Whenever the learning rate goes back to 10^-3, you're at risk of having the model diverge. You can still favor the first task with UPGrad(pref_vector=torch.tensor([1.0, .8, .8, .8, .8, .8])), but at the same time you would need, I think, to reduce the maximum learning rate from 10^-3 to something like 10^-4.

By the way, we just added an example in the documentation to show how to monitor what the aggregator does. It will print the weights found by the aggregator, and the similarity between the aggregation made by UPGrad and what you would have obtained by averaging the rows of the Jacobian (or equivalently, by using gradient descent on the average loss). It could be useful while debugging, to get a better understanding of what JD actually does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vanishing Jacobian problem #343

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Vanishing Jacobian problem #343

Uh oh!

MatthewChen37 May 12, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

ValerianRey May 12, 2025 Maintainer

Uh oh!

Uh oh!

MatthewChen37 May 12, 2025 Author

Uh oh!

ValerianRey May 13, 2025 Maintainer

MatthewChen37
May 12, 2025

Replies: 2 comments 1 reply

ValerianRey
May 12, 2025
Maintainer

MatthewChen37
May 12, 2025
Author

ValerianRey May 13, 2025
Maintainer