-
Notifications
You must be signed in to change notification settings - Fork 13.7k
server: (refactor) implement generator-based API for task results #17174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: (refactor) implement generator-based API for task results #17174
Conversation
|
Trying to address https://github.com/ggml-org/llama.cpp/pull/16486/files#r2419474810 in the meantime Edit: resolved in 31b8b70 |
tools/server/server.cpp
Outdated
|
|
||
| // next responses are streamed | ||
| json first_result_json = first_result->to_json(); | ||
| const auto chunked_content_provider = [first_result_json, gen, oaicompat](size_t, httplib::DataSink & sink) mutable -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: in the future, when we separate the HTTP implementation from the current code base, this chunked_content_provider callback pattern will disappear.
the goal is to make each server endpoint handler itself become a generator, which generate JSON response each time the next() function is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, since this chunked_content_provider lambda function is already a generator itself, we can just keep it and only change the return type.
the ultimate goal is to expose an API that allow writing code like this:
const auto handle_chat_completions = [&](const Request & req, Response & res) {
auto body = json::parse(req.body);
// ... do parsing stuff with body
auto response = handle_completions_impl(...);
if (response.stream) {
// response is now a generator, call next() until returns false
res.set_stream(true);
json chunk;
while (response.next(chunk)) {
res.write(chunk.dump());
}
res.end();
} else {
// non-stream, response is simple object
res.set_content(response.data);
}
}|
I rename "generator" to "reader" as the term "generator" is better to be used to describe the interface between In a follow-up PR, I'll separate all http-related code into its own API. The idea is that For now, this PR should be ready for review. No rush but CC @ggerganov for visibility. |
| // need to store the reader as a pointer, so that it won't be destroyed when the handle returns | ||
| // use shared_ptr as it's shared between the chunked_content_provider() and on_complete() | ||
| const auto rd = std::make_shared<server_response_reader>(ctx_server); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a race condition? I ran the test suite with thread sanitizer and it didn't detect any, but still might be worth verifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there can be a race condition, as both chunked_content_provider and on_complete are run on the same thread.
Looking into the source code, on_complete is only called when Response object is destroyed: https://github.com/yhirose/cpp-httplib/blob/1acf18876fdbd1a4f3ff91f43cf21495af357850/httplib.h#L824-L828
So it safe to assume that chunked_content_provider is never called after that point (and also can't be called at the same time with on_complete, as content_provider_ is a member of Response). Can you think of any other cases that it may cause the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's worth noting that the official example from httplib consider the data (aka rd) as a c-pointer instead of a smart pointer. The data is destroyed in on_complete, so its destruction is tied to Response
Do you think it's preferable to use c-pointer here, like in the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of any other cases that it may cause the problem?
No, it should be OK.
Do you think it's preferable to use c-pointer here, like in the example?
I think the current version is good.
…ml-org#17174) * server: (refactor) implement generator-based API for task results * improve * moving some code * fix "Response ended prematurely" * add sink.done before return false * rm redundant check * rm unused var * rename generator --> reader
This PR adds a generator-based API for receiving task results. It aims to reduce the usage of callback function, making the code looks more "linear", easier to follow.
This also allowing to return correct HTTP error code in streaming case, ref: #16486 (comment)
Example: