-
Couldn't load subscription status.
- Fork 45
feat: KV Blocks #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: KV Blocks #111
Conversation
And the way it fits into a model that uses apply_chat_tempalte or any other parser/renderer. Note that there's still a bug entailed by the chance that there are also substrings which "hit" on the cached contents. We don't anticipate this happens often in practice because of how KV cache smashing should typically be used, but it's something we need to address by introducing the use of sentinel values, or indexing string machines, or something else along those lines. no-verify commit because the point of this code is documentation.
|
I was wondering how it was implemented. The way I envisioned is a token-wise Trie represented as a flat dictionary that stores the legacy KVcache format token-by-token, by splitting the third dimension of This
Thoughts? |
|
Also:
|
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Before / in addition to "rolling our own", we should first/also try to leverage existing stuff in the kv cache management ecosystem (lmcache seems closest to the sort of thing we need). |
| value: str | None, | ||
| meta: dict[str, Any] | None = None, | ||
| *, | ||
| cache: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per @HendrikStrobelt , change name to use_cache or something like that.
DRAFT
Adds
HuggingFace._generate_from_context_with_kv_cache, re-introducing some stuff from our early hacking in May/June. (Consider this a sign we're finally get back to the fun stuff again).Stuff that still needs to be done:
Component.parts()needs implementation and walks, similar to the old spandefault_formatter_walk.use_kv_cachingflag to generate_from_context.