Skip to content

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

As per title, and we will delete Legacy cache format section in 2-3 releases when all legacy support is removed.

This PR:

  • Adds docs on multimodal attention implementation setting
  • Cross references attn implementation docs in other pages. I didn't know that docs page existed and searching for FA2/SDPA usually doesn't return it. Let's make it more discoverable
  • Adds a section on what is cache position, as reported by many users concept of cache position is still confusing. We can add later more examples, I remember @gante had a PR on fixing generate when cache position is provided by users :)

@zucchini-nlp zucchini-nlp requested review from gante and stevhliu July 21, 2025 09:09
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and improved cross-linking!

zucchini-nlp and others added 2 commits July 22, 2025 09:36
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
@zucchini-nlp zucchini-nlp merged commit cd98c1f into huggingface:main Jul 22, 2025
15 checks passed
ArthurZucker added a commit that referenced this pull request Jul 22, 2025
* use partial to wrap around `transformers` utils!

* try to refactor?

* revert one wrong change

* just a nit

* push

* reverter watever was wrong!

* some nits

* fixes when there is no attention mask

* bring the licence back

* some fixes

* nit

* style

* remove prints

* correct dtype

* fa flags for testing

* update

* use paged attention if requested!

* updates

* a clone was needed, not sure why

* automatically create cu seq lens when input is flash, this at least makes sure layers don't re-compute

* simplify and improve?

* flash attention is kinda broken on recent cuda version so allow the opportunity to use something else

* fix!

* protect kernels import

* update

* properly parse generation config being passed

* revert and update

* add two tests

* some fixes

* fix test FA2

* takes comment into account

* fixup

* revert changes

* revert the clone, it is only needed because the metal kernel is not doing it?

* [docs] update attention implementation and cache docs (#39547)

* update docs

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* applu suggestions

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix mps on our side for now

* Update src/transformers/integrations/flash_paged.py

* no qa

---------

Co-authored-by: Vasqu <antonprogamer@gmail.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants