What's effective loss scaling? Does it sum or mean over classes? over batch size? How does it interact with distributed training? Is there anywhere scaling over the world size?