-
Notifications
You must be signed in to change notification settings - Fork 22
Description
It is a strange issue that the loss will blow up to Nan in linux ubuntu 20.04 after I install the cuda, without cuda the loss decrease normally.
environment:
megengine == 1.3.0
torch == 1.9.1
more strange is the loss will fall back :
Epoch 0 Step 0, Speed=1.7 mb/s, dp_cost=0.0098, Loss=4.02e-02, lr=2.00e-04
Epoch 0 Step 10, Speed=14 mb/s, dp_cost=0.25, Loss=1.56e+00, lr=2.00e-04
Epoch 0 Step 20, Speed=19 mb/s, dp_cost=0.034, Loss= nan, lr=2.00e-04
Epoch 0 Step 30, Speed=17 mb/s, dp_cost=0.054, Loss=3.62e-02, lr=2.00e-04
Epoch 0 Step 40, Speed=20 mb/s, dp_cost=0.037, Loss=2.22e-01, lr=2.00e-04
Epoch 0 Step 50, Speed=15 mb/s, dp_cost=0.16, Loss= nan, lr=2.00e-04
Epoch 0 Step 60, Speed=16 mb/s, dp_cost=0.21, Loss= nan, lr=2.00e-04
Epoch 0 Step 70, Speed=18 mb/s, dp_cost=0.033, Loss= nan, lr=2.00e-04
Epoch 0 Step 80, Speed=18 mb/s, dp_cost=0.11, Loss= nan, lr=2.00e-04
Epoch 0 Step 90, Speed=15 mb/s, dp_cost=0.2, Loss= nan, lr=2.00e-04
Epoch 0 Step 100, Speed=12 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 110, Speed=15 mb/s, dp_cost=0.22, Loss= nan, lr=2.00e-04
Epoch 0 Step 120, Speed=17 mb/s, dp_cost=0.035, Loss= nan, lr=2.00e-04
Epoch 0 Step 130, Speed=17 mb/s, dp_cost=0.028, Loss= nan, lr=2.00e-04
Epoch 0 Step 140, Speed=13 mb/s, dp_cost=0.1, Loss= nan, lr=2.00e-04
Epoch 0 Step 150, Speed=17 mb/s, dp_cost=0.24, Loss= nan, lr=2.00e-04
Epoch 0 Step 160, Speed=15 mb/s, dp_cost=0.23, Loss= nan, lr=2.00e-04
Epoch 0 Step 170, Speed=8.5 mb/s, dp_cost=0.18, Loss= nan, lr=2.00e-04
Epoch 0 Step 180, Speed=16 mb/s, dp_cost=0.078, Loss= nan, lr=2.00e-04
Epoch 0 Step 190, Speed=15 mb/s, dp_cost=0.21, Loss= nan, lr=2.00e-04
Epoch 0 Step 200, Speed=15 mb/s, dp_cost=0.06, Loss= nan, lr=2.00e-04
Epoch 0 Step 210, Speed=19 mb/s, dp_cost=0.034, Loss=6.01e-02, lr=2.00e-04
Epoch 0 Step 220, Speed=16 mb/s, dp_cost=0.058, Loss= nan, lr=2.00e-04
Epoch 0 Step 230, Speed=19 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 240, Speed=17 mb/s, dp_cost=0.11, Loss= nan, lr=2.00e-04
Epoch 0 Step 250, Speed=10 mb/s, dp_cost=0.016, Loss= nan, lr=2.00e-04
Epoch 0 Step 260, Speed=16 mb/s, dp_cost=0.028, Loss= nan, lr=2.00e-04
Epoch 0 Step 270, Speed=16 mb/s, dp_cost=0.031, Loss= nan, lr=2.00e-04
Epoch 0 Step 280, Speed=10 mb/s, dp_cost=0.14, Loss= nan, lr=2.00e-04
Epoch 0 Step 290, Speed=17 mb/s, dp_cost=0.034, Loss= nan, lr=2.00e-04
Epoch 0 Step 300, Speed=15 mb/s, dp_cost=0.029, Loss= nan, lr=2.00e-04
Epoch 0 Step 310, Speed=19 mb/s, dp_cost=0.032, Loss=2.82e-01, lr=2.00e-04
Epoch 0 Step 320, Speed=17 mb/s, dp_cost=0.036, Loss=6.47e-02, lr=2.00e-04
Epoch 0 Step 330, Speed=15 mb/s, dp_cost=0.031, Loss= nan, lr=2.00e-04
Epoch 0 Step 340, Speed=13 mb/s, dp_cost=0.082, Loss= nan, lr=2.00e-04
Epoch 0 Step 350, Speed=13 mb/s, dp_cost=0.021, Loss=3.08e-01, lr=2.00e-04
Epoch 0 Step 360, Speed=14 mb/s, dp_cost=0.022, Loss= nan, lr=2.00e-04
Epoch 0 Step 370, Speed=16 mb/s, dp_cost=0.051, Loss= nan, lr=2.00e-04
Epoch 0 Step 380, Speed=11 mb/s, dp_cost=0.19, Loss= nan, lr=2.00e-04
Epoch 0 Step 390, Speed=15 mb/s, dp_cost=0.069, Loss= nan, lr=2.00e-04
Epoch 0 Step 400, Speed=16 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 410, Speed=17 mb/s, dp_cost=0.032, Loss= nan, lr=2.00e-04
Epoch 0 Step 420, Speed=19 mb/s, dp_cost=0.038, Loss= nan, lr=2.00e-04
Epoch 0 Step 430, Speed=17 mb/s, dp_cost=0.035, Loss= nan, lr=2.00e-04
Epoch 0 Step 440, Speed=11 mb/s, dp_cost=0.021, Loss=9.56e-02, lr=2.00e-04
Epoch 0 Step 450, Speed=16 mb/s, dp_cost=0.057, Loss= nan, lr=2.00e-04
Epoch 0 Step 460, Speed=15 mb/s, dp_cost=0.22, Loss= nan, lr=2.00e-04