Skip to content

Conversation

@nrfulton
Copy link
Contributor

@nrfulton nrfulton commented Aug 29, 2025

DRAFT

Adds HuggingFace._generate_from_context_with_kv_cache, re-introducing some stuff from our early hacking in May/June. (Consider this a sign we're finally get back to the fun stuff again).

Stuff that still needs to be done:

  • lmcache integration (currently using a dict which has some... pitfalls... to put it mildly
  • Component.parts() needs implementation and walks, similar to the old span default_formatter_walk.
  • Heuristics for excluding improperly marked blocks/parts
  • Think about how this integrates with alora code
  • Perhaps refactor at least the HF backend code
  • Add use_kv_caching flag to generate_from_context.
  • Add benchmarking (quality impacts and time savings)
  • Tests, examples, tutorials

And the way it fits into a model that uses apply_chat_tempalte or any
other parser/renderer. Note that there's still a bug entailed by the
chance that there are also substrings which "hit" on the cached
contents. We don't anticipate this happens often in practice because of
how KV cache smashing should typically be used, but it's something we
need to address by introducing the use of sentinel values, or indexing
string machines, or something else along those lines.

no-verify commit because the point of this code is documentation.
@nrfulton nrfulton marked this pull request as draft August 29, 2025 18:25
@guicho271828
Copy link
Contributor

I was wondering how it was implemented. The way I envisioned is a token-wise Trie represented as a flat dictionary that stores the legacy KVcache format token-by-token, by splitting the third dimension of [batch_size, num_heads, seq_len, head_dim] token-by-token as a [1, num_heads, 1, head_dim] tensor.

This

  • avoids variable-length stitching between partial kv cache info which might be error-prone. Instead, every traversal is token-by-token.
  • separates cache retrieval as a simple, datastructre-oriented class, clearly named as KVCacheTrie.
  • is C++/Rust reimplementation friendly due to a simpler datastructure.
  • has a single retrieval call right before submitting a string query to the LLM, which allows a tranparent access to the cache. This is different from the way it appears to be done here, which requires a big _generate_from_context_with_kv_cache method.

Thoughts?

@guicho271828
Copy link
Contributor

Also:

  • Allows sharing the cache when two components share a prefix.

@mergify
Copy link

mergify bot commented Sep 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@nrfulton nrfulton changed the title KV Blocks feat: KV Blocks Sep 4, 2025
@nrfulton
Copy link
Contributor Author

nrfulton commented Sep 4, 2025

I was wondering how it was implemented. The way I envisioned is a token-wise Trie represented as a flat dictionary that stores the legacy KVcache format token-by-token...

Thoughts?

Before / in addition to "rolling our own", we should first/also try to leverage existing stuff in the kv cache management ecosystem (lmcache seems closest to the sort of thing we need).

value: str | None,
meta: dict[str, Any] | None = None,
*,
cache: bool = False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per @HendrikStrobelt , change name to use_cache or something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants