[docs] update attention implementation and cache docs #39547

zucchini-nlp · 2025-07-21T09:09:22Z

What does this PR do?

As per title, and we will delete Legacy cache format section in 2-3 releases when all legacy support is removed.

This PR:

Adds docs on multimodal attention implementation setting
Cross references attn implementation docs in other pages. I didn't know that docs page existed and searching for FA2/SDPA usually doesn't return it. Let's make it more discoverable
Adds a section on what is cache position, as reported by many users concept of cache position is still confusing. We can add later more examples, I remember @gante had a PR on fixing generate when cache position is provided by users :)

HuggingFaceDocBuilderDev · 2025-07-21T09:22:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu

Thanks for the updates and improved cross-linking!

docs/source/en/attention_interface.md

docs/source/en/cache_explanation.md

docs/source/en/llm_optims.md

docs/source/en/perf_infer_gpu_one.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* use partial to wrap around `transformers` utils! * try to refactor? * revert one wrong change * just a nit * push * reverter watever was wrong! * some nits * fixes when there is no attention mask * bring the licence back * some fixes * nit * style * remove prints * correct dtype * fa flags for testing * update * use paged attention if requested! * updates * a clone was needed, not sure why * automatically create cu seq lens when input is flash, this at least makes sure layers don't re-compute * simplify and improve? * flash attention is kinda broken on recent cuda version so allow the opportunity to use something else * fix! * protect kernels import * update * properly parse generation config being passed * revert and update * add two tests * some fixes * fix test FA2 * takes comment into account * fixup * revert changes * revert the clone, it is only needed because the metal kernel is not doing it? * [docs] update attention implementation and cache docs (#39547) * update docs * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * applu suggestions --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix mps on our side for now * Update src/transformers/integrations/flash_paged.py * no qa --------- Co-authored-by: Vasqu <antonprogamer@gmail.com> Co-authored-by: Raushan Turganbay <raushan@huggingface.co> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

update docs

bd34e45

zucchini-nlp requested review from gante and stevhliu July 21, 2025 09:09

stevhliu approved these changes Jul 21, 2025

View reviewed changes

zucchini-nlp and others added 2 commits July 22, 2025 09:36

Apply suggestions from code review

5c99f69

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

applu suggestions

c14bd48

zucchini-nlp merged commit cd98c1f into huggingface:main Jul 22, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[docs] update attention implementation and cache docs #39547

[docs] update attention implementation and cache docs #39547

Uh oh!

zucchini-nlp commented Jul 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 21, 2025

Uh oh!

stevhliu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[docs] update attention implementation and cache docs #39547

[docs] update attention implementation and cache docs #39547

Uh oh!

Conversation

zucchini-nlp commented Jul 21, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 21, 2025

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!