-
Notifications
You must be signed in to change notification settings - Fork 65
Open
Description
in your source code
the first time in forward,
you use tokens from 0:1 in each batch,this is not ture for llama2(decoder only)
llama2 can be devided into prefill step and decoder steps ,in prefill step,all tokens should be translated to forward instead of the first token
then the prefill step generate the first new token ,and add it into tokens list and then generate the next.
so the calculation is that ,first cal [sequence_length.dim] then (use kv cache),only cal [1,dim]
you can see the true code in llama repohttps://github.com/meta-llama/llama/blob/main/llama/generation.py
also, i think the name of decoder layer is described by encoder layer in your source code
Metadata
Metadata
Assignees
Labels
No labels