Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294

isaacgerg · 2024-09-20T20:41:15Z

isaacgerg
Sep 20, 2024

I am trying to manually save my loss, which is a single scalar, each epoch and then print it out in on_train_epoch_end().

I am doing self.train_loss.append(loss.item()) during my training step. Next, in on_train_epoch_end(), I immediately do a self.all_gather(self.train_loss) but it hangs until NCCL times out. I am on a single node with 2 GPUs.

What really stumps me is that this paradigm works fine for on_test_epoch_end() and test_step().

Any thoughts on how to debug or fix? What is the "right" way this code should look when operating correctly.

Reference: Using pytorch 2.4 and lightning 2.4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294

Uh oh!

Uh oh!

isaacgerg Sep 20, 2024

Replies: 0 comments

isaacgerg
Sep 20, 2024