fix: Allow externally managed contexts with LLamaEmbedder #1263

bmazzarol-bunnings · 2025-09-29T14:36:17Z

Fixes #1259 and potentially #1247 with changes to how the caller manages the LLamaEmbedder.

Fixes SciSharp#1259 and potentially SciSharp#1247 with changes to how the caller manages the LLamaEmbedder.

bmazzarol-bunnings · 2025-09-29T14:41:35Z

@martindevans need some help with re-instating the old code that used to reset the kv_cache values. https://github.com/bmazzarol-bunnings/LLamaSharp/blob/test/context-cost-24/LLama/LLamaEmbedder.cs#L73-L74

Or if it is not required then all good. It looked important.

martindevans · 2025-10-19T14:38:00Z

Sorry for the delay on reviewing this.

The only issue I see with the current approach is that the embedder uses LLamaSeqId.Zero for all it's work. If sequence zero is being used for something else (e.g. another embedder sharing the same context) that would be bad!

… models

bmazzarol-bunnings · 2025-11-04T14:55:48Z

@martindevans I have attempted something, make it clear if this is too far outside what you expected.

I have a simple sequence id manager that ensures that a single embedding process against the embedding instance will have exclusive use of a sequence id within the range of max sequence id. I then clear the underlying memory associated with the sequence id and return it back for re-use once the embedding operation is complete.

Questions,

Is the clearing code always required? The embedding model did not care, but the generative one did.
Do the sequence numbers always have to be sequential? Don't think my implementation guarantees that (concurrent bag and all that)
Is the clearing of the sequence clearing more than required? The old code looked targeted to KV cache?
I assume once the global interpreter lock goes, embedding can be done in parallel against the LLamaEmbedder

martindevans · 2025-11-05T12:31:18Z

Wow that's more than I expected, in a good way though!

Should the sequence manager perhaps be moved into the LLamaContext? That way LLamaContext can expose GetSequence/ReturnSequence methods so that any service that's using the LLamaContext can request an available sequence.

…e ID management

bmazzarol-bunnings · 2025-11-05T14:51:27Z

@martindevans I have integrated the sequence manager into the LLamaContext.

I integrated the return into a dispose implementation. Let me know what you think.

I still need to test out the sequence memory usage scenarios to ensure non-sequential sequence id values can be used without issue.

bmazzarol-bunnings · 2025-11-08T15:12:54Z

@martindevans this approach will not scale. I will close this and later raise another issue outlining what I have landed on at work for making embeddings scale with this library.

Might be obvious to those in the know with llama.cpp, but I had made bad assumptions. All the multithreading is handled by llama.cpp, there is nothing to do on the dotnet side, for a backend engineer it was not what I was used to.

The general design is that you want to pack as many tokens into the batch per inference as you can. But you don't want to exceed more sequences than you have logical cores. You create a LLamaContext per inferencing run and you set the BatchSize based on the following calculation,

BatchSize = ContextSize * MaxSeq

Where,

MaxSeq = Math.Min(maxProcessors, inputs.Count)

And to make the embedding generator thread-safe, you place a channel in front, and have a single thread spawned to read off it and process the requests which are already batched. We then re-batch the requests as long as they can be synchronously read.

So, the current LLamaEmbedder has issues (it's not batchy for a start), but I think it can be fixed with something link this design. The challenge here is that the project is net standard 2.0, so it does not have access the Channel. I know BlockingQueue is there, but it is non-trivial to build that up to the same level of sophistication.

fix: Allow externally managed contexts with LLamaEmbedder

cefb091

Fixes SciSharp#1259 and potentially SciSharp#1247 with changes to how the caller manages the LLamaEmbedder.

Merge branch 'SciSharp:master' into feat/const-context-fixes

c0f4082

feat: Implement LLamaSeqIdManager for sequence ID management in LLama…

38b8d8b

… models

feat: Enhance LLamaSeqIdManager and LLamaContext for improved sequenc…

2fb0355

…e ID management

bmazzarol-bunnings added 2 commits November 5, 2025 23:14

fix: Remove unused LLamaSeqIdManager from LLamaEmbedder

b1585a7

fix: Improve error handling in NextAsync method of LLamaSeqIdManager

ef52ebb

bmazzarol-bunnings closed this Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Allow externally managed contexts with LLamaEmbedder #1263

fix: Allow externally managed contexts with LLamaEmbedder #1263

bmazzarol-bunnings commented Sep 29, 2025

Uh oh!

bmazzarol-bunnings commented Sep 29, 2025

Uh oh!

martindevans commented Oct 19, 2025

Uh oh!

bmazzarol-bunnings commented Nov 4, 2025

Uh oh!

martindevans commented Nov 5, 2025

Uh oh!

bmazzarol-bunnings commented Nov 5, 2025

Uh oh!

bmazzarol-bunnings commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Allow externally managed contexts with LLamaEmbedder #1263

fix: Allow externally managed contexts with LLamaEmbedder #1263

Conversation

bmazzarol-bunnings commented Sep 29, 2025

Uh oh!

bmazzarol-bunnings commented Sep 29, 2025

Uh oh!

martindevans commented Oct 19, 2025

Uh oh!

bmazzarol-bunnings commented Nov 4, 2025

Uh oh!

martindevans commented Nov 5, 2025

Uh oh!

bmazzarol-bunnings commented Nov 5, 2025

Uh oh!

bmazzarol-bunnings commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants