server: (refactor) implement generator-based API for task results #17174

ngxson · 2025-11-11T18:29:41Z

This PR adds a generator-based API for receiving task results. It aims to reduce the usage of callback function, making the code looks more "linear", easier to follow.

This also allowing to return correct HTTP error code in streaming case, ref: #16486 (comment)

Example:

server_response_generator gen(ctx_server);
{
    std::vector<server_task> tasks;
    // ... populate tasks ...
    gen.post_tasks(std::move(tasks));
}

// wait for the results
auto all_results = gen.wait_for_all(req.is_connection_closed);

// collect results
if (all_results.is_terminated) {
    return; // connection is closed
} else if (all_results.error) {
    res_error(res, all_results.error->to_json());
    return;
} else {
    for (auto & res : all_results.results) {
        GGML_ASSERT(dynamic_cast<server_task_result_embd*>(res.get()) != nullptr);
        responses.push_back(res->to_json());
    }
}

ngxson · 2025-11-11T18:35:35Z

Trying to address https://github.com/ggml-org/llama.cpp/pull/16486/files#r2419474810 in the meantime

Edit: resolved in 31b8b70

ngxson · 2025-11-11T19:10:07Z

tools/server/server.cpp

+
+            // next responses are streamed
+            json first_result_json = first_result->to_json();
+            const auto chunked_content_provider = [first_result_json, gen, oaicompat](size_t, httplib::DataSink & sink) mutable -> bool {


~~note: in the future, when we separate the HTTP implementation from the current code base, this chunked_content_provider callback pattern will disappear.~~

the goal is to make each server endpoint handler itself become a generator, which generate JSON response each time the next() function is called

on second thought, since this chunked_content_provider lambda function is already a generator itself, we can just keep it and only change the return type.

the ultimate goal is to expose an API that allow writing code like this:

const auto handle_chat_completions = [&](const Request & req, Response & res) { auto body = json::parse(req.body); // ... do parsing stuff with body auto response = handle_completions_impl(...); if (response.stream) { // response is now a generator, call next() until returns false res.set_stream(true); json chunk; while (response.next(chunk)) { res.write(chunk.dump()); } res.end(); } else { // non-stream, response is simple object res.set_content(response.data); } }

ngxson · 2025-11-12T13:29:34Z

I rename "generator" to "reader" as the term "generator" is better to be used to describe the interface between server_context and the HTTP layer.

In a follow-up PR, I'll separate all http-related code into its own API. The idea is that server_context returns a response_generator to HTTP layer, and HTTP layer simply call next() until there is no data left.

For now, this PR should be ready for review. No rush but CC @ggerganov for visibility.

ggerganov · 2025-11-12T13:37:41Z

tools/server/server.cpp

+        // need to store the reader as a pointer, so that it won't be destroyed when the handle returns
+        // use shared_ptr as it's shared between the chunked_content_provider() and on_complete()
+        const auto rd = std::make_shared<server_response_reader>(ctx_server);


Could there be a race condition? I ran the test suite with thread sanitizer and it didn't detect any, but still might be worth verifying.

I don't think there can be a race condition, as both chunked_content_provider and on_complete are run on the same thread.

Looking into the source code, on_complete is only called when Response object is destroyed: https://github.com/yhirose/cpp-httplib/blob/1acf18876fdbd1a4f3ff91f43cf21495af357850/httplib.h#L824-L828

So it safe to assume that chunked_content_provider is never called after that point (and also can't be called at the same time with on_complete, as content_provider_ is a member of Response). Can you think of any other cases that it may cause the problem?

Also, it's worth noting that the official example from httplib consider the data (aka rd) as a c-pointer instead of a smart pointer. The data is destroyed in on_complete, so its destruction is tied to Response

Do you think it's preferable to use c-pointer here, like in the example?

Can you think of any other cases that it may cause the problem?

No, it should be OK.

Do you think it's preferable to use c-pointer here, like in the example?

I think the current version is good.

tools/server/server.cpp

…ml-org#17174) * server: (refactor) implement generator-based API for task results * improve * moving some code * fix "Response ended prematurely" * add sink.done before return false * rm redundant check * rm unused var * rename generator --> reader

ngxson added 5 commits November 11, 2025 18:36

server: (refactor) implement generator-based API for task results

dfa2400

improve

88277d8

moving some code

440ce93

fix "Response ended prematurely"

993440e

add sink.done before return false

cc2e397

ngxson requested a review from ggerganov as a code owner November 11, 2025 18:29

github-actions bot added examples server labels Nov 11, 2025

rm redundant check

31b8b70

DajanaV mentioned this pull request Nov 11, 2025

UPSTREAM PR #17174: server: (refactor) implement generator-based API for task results auroralabs-loci/llama.cpp#170

Closed

ngxson commented Nov 11, 2025

View reviewed changes

rm unused var

efd73cf

ngxson mentioned this pull request Nov 11, 2025

Refactor: abstract out HTTP-related code from server #16488

Open

ngxson added 2 commits November 12, 2025 14:19

Merge branch 'master' into xsn/server_response_generator_refactor

f3bdded

rename generator --> reader

bfa5a70

ggerganov approved these changes Nov 12, 2025

View reviewed changes

ggerganov reviewed Nov 12, 2025

View reviewed changes

tools/server/server.cpp Show resolved Hide resolved

ngxson merged commit 00c9408 into ggml-org:master Nov 12, 2025
69 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: (refactor) implement generator-based API for task results #17174

server: (refactor) implement generator-based API for task results #17174

Uh oh!

ngxson commented Nov 11, 2025 •

edited

Loading

Uh oh!

ngxson commented Nov 11, 2025 •

edited

Loading

Uh oh!

ngxson Nov 11, 2025 •

edited

Loading

Uh oh!

ngxson Nov 11, 2025

Uh oh!

ngxson commented Nov 12, 2025

Uh oh!

ggerganov Nov 12, 2025

Uh oh!

ngxson Nov 12, 2025 •

edited

Loading

Uh oh!

ngxson Nov 12, 2025

Uh oh!

ggerganov Nov 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

server: (refactor) implement generator-based API for task results #17174

server: (refactor) implement generator-based API for task results #17174

Uh oh!

Conversation

ngxson commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 12, 2025

Uh oh!

ggerganov Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Nov 11, 2025 •

edited

Loading

ngxson commented Nov 11, 2025 •

edited

Loading

ngxson Nov 11, 2025 •

edited

Loading

ngxson Nov 12, 2025 •

edited

Loading

ggerganov Nov 12, 2025 •

edited

Loading