Vanishing Jacobian problem #343
Replies: 2 comments 1 reply
-
Hi! Thanks for your interest in TorchJD! It seems that the training of your first loss is quite unstable, making the model diverge (or at least the part of the model leading to your first loss) which leads to this issue of having nan values in the gradient. If this comes from an exploding gradient issue, you could try to use
...
aggregator = UPGrad(pref_vector=torch.tensor([0.1, 1., 1., 1., 1., 1.]))
...
...
torchjd.mtl_backward([loss1 / 10, loss2, loss3, loss4, loss5, loss6], features=features, aggregator=aggregator)
... I'm not sure why the other rows of the jacobian are zero. This could be due to a vanishing gradient problem. Is this always the case, even at the very first iteration of training? If so, there could be an issue in the way you use the A possible way to debug would also be to use the I hope this will help! |
Beta Was this translation helpful? Give feedback.
-
Hi, Thanks for the reply. I already have some layer norms within my model’s “Jacobian Descent/shared_features” part. I’ll try experimenting with the pref vector. I currently have the pref_vector slightly emphasizing the first task more, as I want it to be optimized more heavily:
This error usually appears deep into the training process (around the 9th or 10th training epoch). Something else to note is that I am using a CosineAnneallingWithWarmRestarts scheduler, which usually occurs when the learning rate "resets" from something small like 10^-7 back to 10^-3. I’ll try experimenting some more. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear TorchJD maintainers,
First, thank you for this incredible work, as I could easily integrate it into my research on multi-task learning for pre-training foundation models. I’m not sure if this is an issue you have encountered before, or even well-known, but I have seemed to run into the Jacobian analogue of the vanishing gradient problem. After several training steps while pre-training my foundation model with TorchJD, the Jacobian values of the tensors connected to the shared features have turned into NaNs during backpropagation:
Note: I modified the
bases.py
file to print out the shapes of the offending matrices when the error is thrown.Have you encountered this issue before, whether this is a bug on my end or a well-known problem in Jacobian descent?
I currently have multiple contrastive losses during pre-training that manipulate embedding outputs at different layers of the model, i.e., intermediate losses between layers. These losses specify the "shared_features" as the embeddings outputted before the first loss, which I suspect could be something I am implementing wrong on my end. Let me know if you have any other questions about my training setup to help understand the issue, but I’m afraid I’m not allowed to share everything.
Sincerely,
Matthew Chen
Beta Was this translation helpful? Give feedback.
All reactions