Sequence Parallelism, memory usage question #3803
EthanChen1234
started this conversation in
Community | General
Replies: 1 comment 2 replies
-
I'm still waiting. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
the paper, "Sequence Parallelism: Long Sequence Training from System Perspective"
for the tensor parallelism, the activations and trainable weight consume memory.
the activations, contain 4BLH/N and BLH. the trainable weight should be 4H^2/N + 4H^2/N = 8H^2/N.
I'm confused, the 32H^2/N memory use.
Beta Was this translation helpful? Give feedback.
All reactions