Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 12, 2025

Fix #16488

How it works:

sequenceDiagram
    participant User
    participant server_http_context
    participant server_http_res
    
    User->>server_http_context: request
    server_http_context->>server_http_req: create request
    server_http_req->>handler:
    handler->>server_http_res: create response
    
    loop for each result
        server_http_res->>server_http_context: response chunk
        server_http_context->>User: response chunk
        server_http_context->>server_http_res: next()
    end

    server_http_res->>server_http_context: terminate
    server_http_context->>User: close connection
Loading
  • Each endpoint handler returns a server_res_generator, which is a derived class from server_http_res
  • The server_res_generator indicates one of 2 modes: stream or non-stream
    • In non-stream mode, we simply return the data back to user
    • In stream mode, we call server_res_generator::next() until it returns false. Each time we call next(), we get a new chunk of data

TODO:

  • fix error handling
  • add exception handler at server_routes level

Testing:

  • passed automated tests.sh
  • test normal usage with web UI (with multimodal input)
  • test usage with web UI, with concurrent requests and random interruptions

@ngxson ngxson force-pushed the xsn/split_http_server_context branch from 0594df9 to 45b2fe1 Compare November 12, 2025 17:53
@ngxson ngxson marked this pull request as ready for review November 13, 2025 10:21
@ngxson ngxson requested a review from ggerganov as a code owner November 13, 2025 10:21
@ngxson
Copy link
Collaborator Author

ngxson commented Nov 13, 2025

No rush for reviewing this, would appreciate if you can do some testings on your side @ggerganov

In the next PR, I'll try to break the server.cpp into smaller pieces, the rough plan will be:

  • server-context.cpp
  • server-queue.cpp
  • server-task.cpp (containing both task + response + queue)
  • server-common.cpp (everything else)

While working on this, I'm also thinking about maybe re-using server code in llama-cli (I made a demo here); the main benefit will be to bring the same webui experience into CLI, including multimodal support, conversation control (delete/regenerate message), tool call, etc. The old CLI can be moved to llama-completion and the chat support will be removed from it. What do you think about this idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: abstract out HTTP-related code from server

1 participant