Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 11, 2025

This PR adds a generator-based API for receiving task results. It aims to reduce the usage of callback function, making the code looks more "linear", easier to follow.

This also allowing to return correct HTTP error code in streaming case, ref: #16486 (comment)

Example:

server_response_generator gen(ctx_server);
{
    std::vector<server_task> tasks;
    // ... populate tasks ...
    gen.post_tasks(std::move(tasks));
}

// wait for the results
auto all_results = gen.wait_for_all(req.is_connection_closed);

// collect results
if (all_results.is_terminated) {
    return; // connection is closed
} else if (all_results.error) {
    res_error(res, all_results.error->to_json());
    return;
} else {
    for (auto & res : all_results.results) {
        GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
        responses.push_back(res->to_json());
    }
}

@ngxson
Copy link
Collaborator Author

ngxson commented Nov 11, 2025

Trying to address https://github.com/ggml-org/llama.cpp/pull/16486/files#r2419474810 in the meantime

Edit: resolved in 31b8b70


// next responses are streamed
json first_result_json = first_result->to_json();
const auto chunked_content_provider = [first_result_json, gen, oaicompat](size_t, httplib::DataSink & sink) mutable -> bool {
Copy link
Collaborator Author

@ngxson ngxson Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: in the future, when we separate the HTTP implementation from the current code base, this chunked_content_provider callback pattern will disappear.

the goal is to make each server endpoint handler itself become a generator, which generate JSON response each time the next() function is called

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, since this chunked_content_provider lambda function is already a generator itself, we can just keep it and only change the return type.

the ultimate goal is to expose an API that allow writing code like this:

const auto handle_chat_completions = [&](const Request & req, Response & res) {
    auto body = json::parse(req.body);
    // ... do parsing stuff with body
    auto response = handle_completions_impl(...);
    if (response.stream) {
        // response is now a generator, call next() until returns false
        res.set_stream(true);
        json chunk;
        while (response.next(chunk)) {
            res.write(chunk.dump());
        }
        res.end();
    } else {
        // non-stream, response is simple object
        res.set_content(response.data);
    }
}

@ngxson
Copy link
Collaborator Author

ngxson commented Nov 12, 2025

I rename "generator" to "reader" as the term "generator" is better to be used to describe the interface between server_context and the HTTP layer.

In a follow-up PR, I'll separate all http-related code into its own API. The idea is that server_context returns a response_generator to HTTP layer, and HTTP layer simply call next() until there is no data left.

For now, this PR should be ready for review. No rush but CC @ggerganov for visibility.

Comment on lines +5002 to +5004
// need to store the reader as a pointer, so that it won't be destroyed when the handle returns
// use shared_ptr as it's shared between the chunked_content_provider() and on_complete()
const auto rd = std::make_shared<server_response_reader>(ctx_server);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a race condition? I ran the test suite with thread sanitizer and it didn't detect any, but still might be worth verifying.

Copy link
Collaborator Author

@ngxson ngxson Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there can be a race condition, as both chunked_content_provider and on_complete are run on the same thread.

Looking into the source code, on_complete is only called when Response object is destroyed: https://github.com/yhirose/cpp-httplib/blob/1acf18876fdbd1a4f3ff91f43cf21495af357850/httplib.h#L824-L828

So it safe to assume that chunked_content_provider is never called after that point (and also can't be called at the same time with on_complete, as content_provider_ is a member of Response). Can you think of any other cases that it may cause the problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's worth noting that the official example from httplib consider the data (aka rd) as a c-pointer instead of a smart pointer. The data is destroyed in on_complete, so its destruction is tied to Response

Do you think it's preferable to use c-pointer here, like in the example?

Copy link
Member

@ggerganov ggerganov Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of any other cases that it may cause the problem?

No, it should be OK.

Do you think it's preferable to use c-pointer here, like in the example?

I think the current version is good.

@ngxson ngxson merged commit 00c9408 into ggml-org:master Nov 12, 2025
69 of 72 checks passed
basnijholt pushed a commit to basnijholt/llama.cpp that referenced this pull request Nov 16, 2025
…ml-org#17174)

* server: (refactor) implement generator-based API for task results

* improve

* moving some code

* fix "Response ended prematurely"

* add sink.done before return false

* rm redundant check

* rm unused var

* rename generator --> reader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants