Skip to content

why in linux loss blow up? #9

@DUCH714

Description

@DUCH714

It is a strange issue that the loss will blow up to Nan in linux ubuntu 20.04 after I install the cuda, without cuda the loss decrease normally.

environment:
megengine == 1.3.0
torch == 1.9.1

more strange is the loss will fall back :

Epoch 0 Step 0, Speed=1.7 mb/s, dp_cost=0.0098, Loss=4.02e-02, lr=2.00e-04
Epoch 0 Step 10, Speed=14 mb/s, dp_cost=0.25, Loss=1.56e+00, lr=2.00e-04
Epoch 0 Step 20, Speed=19 mb/s, dp_cost=0.034, Loss= nan, lr=2.00e-04
Epoch 0 Step 30, Speed=17 mb/s, dp_cost=0.054, Loss=3.62e-02, lr=2.00e-04
Epoch 0 Step 40, Speed=20 mb/s, dp_cost=0.037, Loss=2.22e-01, lr=2.00e-04
Epoch 0 Step 50, Speed=15 mb/s, dp_cost=0.16, Loss= nan, lr=2.00e-04
Epoch 0 Step 60, Speed=16 mb/s, dp_cost=0.21, Loss= nan, lr=2.00e-04
Epoch 0 Step 70, Speed=18 mb/s, dp_cost=0.033, Loss= nan, lr=2.00e-04
Epoch 0 Step 80, Speed=18 mb/s, dp_cost=0.11, Loss= nan, lr=2.00e-04
Epoch 0 Step 90, Speed=15 mb/s, dp_cost=0.2, Loss= nan, lr=2.00e-04
Epoch 0 Step 100, Speed=12 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 110, Speed=15 mb/s, dp_cost=0.22, Loss= nan, lr=2.00e-04
Epoch 0 Step 120, Speed=17 mb/s, dp_cost=0.035, Loss= nan, lr=2.00e-04
Epoch 0 Step 130, Speed=17 mb/s, dp_cost=0.028, Loss= nan, lr=2.00e-04
Epoch 0 Step 140, Speed=13 mb/s, dp_cost=0.1, Loss= nan, lr=2.00e-04
Epoch 0 Step 150, Speed=17 mb/s, dp_cost=0.24, Loss= nan, lr=2.00e-04
Epoch 0 Step 160, Speed=15 mb/s, dp_cost=0.23, Loss= nan, lr=2.00e-04
Epoch 0 Step 170, Speed=8.5 mb/s, dp_cost=0.18, Loss= nan, lr=2.00e-04
Epoch 0 Step 180, Speed=16 mb/s, dp_cost=0.078, Loss= nan, lr=2.00e-04
Epoch 0 Step 190, Speed=15 mb/s, dp_cost=0.21, Loss= nan, lr=2.00e-04
Epoch 0 Step 200, Speed=15 mb/s, dp_cost=0.06, Loss= nan, lr=2.00e-04
Epoch 0 Step 210, Speed=19 mb/s, dp_cost=0.034, Loss=6.01e-02, lr=2.00e-04
Epoch 0 Step 220, Speed=16 mb/s, dp_cost=0.058, Loss= nan, lr=2.00e-04
Epoch 0 Step 230, Speed=19 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 240, Speed=17 mb/s, dp_cost=0.11, Loss= nan, lr=2.00e-04
Epoch 0 Step 250, Speed=10 mb/s, dp_cost=0.016, Loss= nan, lr=2.00e-04
Epoch 0 Step 260, Speed=16 mb/s, dp_cost=0.028, Loss= nan, lr=2.00e-04
Epoch 0 Step 270, Speed=16 mb/s, dp_cost=0.031, Loss= nan, lr=2.00e-04
Epoch 0 Step 280, Speed=10 mb/s, dp_cost=0.14, Loss= nan, lr=2.00e-04
Epoch 0 Step 290, Speed=17 mb/s, dp_cost=0.034, Loss= nan, lr=2.00e-04
Epoch 0 Step 300, Speed=15 mb/s, dp_cost=0.029, Loss= nan, lr=2.00e-04
Epoch 0 Step 310, Speed=19 mb/s, dp_cost=0.032, Loss=2.82e-01, lr=2.00e-04
Epoch 0 Step 320, Speed=17 mb/s, dp_cost=0.036, Loss=6.47e-02, lr=2.00e-04
Epoch 0 Step 330, Speed=15 mb/s, dp_cost=0.031, Loss= nan, lr=2.00e-04
Epoch 0 Step 340, Speed=13 mb/s, dp_cost=0.082, Loss= nan, lr=2.00e-04
Epoch 0 Step 350, Speed=13 mb/s, dp_cost=0.021, Loss=3.08e-01, lr=2.00e-04
Epoch 0 Step 360, Speed=14 mb/s, dp_cost=0.022, Loss= nan, lr=2.00e-04
Epoch 0 Step 370, Speed=16 mb/s, dp_cost=0.051, Loss= nan, lr=2.00e-04
Epoch 0 Step 380, Speed=11 mb/s, dp_cost=0.19, Loss= nan, lr=2.00e-04
Epoch 0 Step 390, Speed=15 mb/s, dp_cost=0.069, Loss= nan, lr=2.00e-04
Epoch 0 Step 400, Speed=16 mb/s, dp_cost=0.037, Loss= nan, lr=2.00e-04
Epoch 0 Step 410, Speed=17 mb/s, dp_cost=0.032, Loss= nan, lr=2.00e-04
Epoch 0 Step 420, Speed=19 mb/s, dp_cost=0.038, Loss= nan, lr=2.00e-04
Epoch 0 Step 430, Speed=17 mb/s, dp_cost=0.035, Loss= nan, lr=2.00e-04
Epoch 0 Step 440, Speed=11 mb/s, dp_cost=0.021, Loss=9.56e-02, lr=2.00e-04
Epoch 0 Step 450, Speed=16 mb/s, dp_cost=0.057, Loss= nan, lr=2.00e-04
Epoch 0 Step 460, Speed=15 mb/s, dp_cost=0.22, Loss= nan, lr=2.00e-04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions