-
Notifications
You must be signed in to change notification settings - Fork 475
fix: Allow externally managed contexts with LLamaEmbedder #1263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow externally managed contexts with LLamaEmbedder #1263
Conversation
Fixes SciSharp#1259 and potentially SciSharp#1247 with changes to how the caller manages the LLamaEmbedder.
|
@martindevans need some help with re-instating the old code that used to reset the kv_cache values. https://github.com/bmazzarol-bunnings/LLamaSharp/blob/test/context-cost-24/LLama/LLamaEmbedder.cs#L73-L74 Or if it is not required then all good. It looked important. |
|
Sorry for the delay on reviewing this. The only issue I see with the current approach is that the embedder uses |
|
@martindevans I have attempted something, make it clear if this is too far outside what you expected. I have a simple sequence id manager that ensures that a single embedding process against the embedding instance will have exclusive use of a sequence id within the range of max sequence id. I then clear the underlying memory associated with the sequence id and return it back for re-use once the embedding operation is complete. Questions,
|
|
Wow that's more than I expected, in a good way though! Should the sequence manager perhaps be moved into the LLamaContext? That way LLamaContext can expose |
|
@martindevans I have integrated the sequence manager into the LLamaContext. I integrated the return into a dispose implementation. Let me know what you think. I still need to test out the sequence memory usage scenarios to ensure non-sequential sequence id values can be used without issue. |
|
@martindevans this approach will not scale. I will close this and later raise another issue outlining what I have landed on at work for making embeddings scale with this library. Might be obvious to those in the know with llama.cpp, but I had made bad assumptions. All the multithreading is handled by llama.cpp, there is nothing to do on the dotnet side, for a backend engineer it was not what I was used to. The general design is that you want to pack as many tokens into the batch per inference as you can. But you don't want to exceed more sequences than you have logical cores. You create a LLamaContext per inferencing run and you set the BatchSize based on the following calculation, BatchSize = ContextSize * MaxSeq Where, MaxSeq = Math.Min(maxProcessors, inputs.Count) And to make the embedding generator thread-safe, you place a channel in front, and have a single thread spawned to read off it and process the requests which are already batched. We then re-batch the requests as long as they can be synchronously read. So, the current LLamaEmbedder has issues (it's not batchy for a start), but I think it can be fixed with something link this design. The challenge here is that the project is net standard 2.0, so it does not have access the Channel. I know BlockingQueue is there, but it is non-trivial to build that up to the same level of sophistication. |
Fixes #1259 and potentially #1247 with changes to how the caller manages the LLamaEmbedder.