Mitigating program hang from on_train_epoch_end() with self.all_gather() call #20294
              
                Unanswered
              
          
                  
                    
                      isaacgerg
                    
                  
                
                  asked this question in
                DDP / multi-GPU / multi-node
              
            Replies: 0 comments
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to manually save my loss, which is a single scalar, each epoch and then print it out in on_train_epoch_end().
I am doing self.train_loss.append(loss.item()) during my training step. Next, in on_train_epoch_end(), I immediately do a self.all_gather(self.train_loss) but it hangs until NCCL times out. I am on a single node with 2 GPUs.
What really stumps me is that this paradigm works fine for on_test_epoch_end() and test_step().
Any thoughts on how to debug or fix? What is the "right" way this code should look when operating correctly.
Reference: Using pytorch 2.4 and lightning 2.4.0.
Beta Was this translation helpful? Give feedback.
All reactions