Some questions about your source code

in your source code 
![image](https://github.com/user-attachments/assets/31fac10b-8019-4c78-ab7b-d65f910ce866)
the first time in forward,
you use tokens from 0:1 in each batch，this is not ture for llama2(decoder only)
llama2 can be devided into prefill step and decoder steps ,in prefill step,all tokens should be translated to forward instead of the first token
then the prefill step generate the first new token ,and add it into tokens list and then generate the next.
so the calculation is that ,first cal [sequence_length.dim] then (use kv cache)，only cal [1,dim]
you can see the true code in llama repo[https://github.com/meta-llama/llama/blob/main/llama/generation.py](url)
![image](https://github.com/user-attachments/assets/2817fb13-1d86-4b04-b5e2-844e325ac9f0)

also, i think the name of decoder layer is described by encoder layer in your source code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some questions about your source code #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some questions about your source code #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions