Skip to content

Conversation

CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented Oct 17, 2025

Not ready to be merged or fully reviewed yet.

However, since we have already implemented the essential building blocks and passed my naive single request test, I propose a draft PR here for whom it may concern.

Motivation

InternVL3.5 proposed Decoupled Vision Deployment (DvD), also known as Encode-Prefill-Decode Disaggregation (EPD) in many other papers. We use the term EPD in the following descriptions. This paradigm has the potential to improve Time-to-First-Token (TTFT) and throughput.

Design

In this section, we explain the design logic of EPD in lmdeploy.

  1. Entire workflow
Sequence Diagram (click to expand)
sequenceDiagram
    participant Client
    participant API Server (Proxy)
    participant AsyncEngine
    participant Encoder Engine
    participant LLM Engine
    participant P2P Connection (ZMQ)

    %% Phase 1: Encoder Processing
    Client->>+API Server (Proxy): POST /v1/chat/completions (prompt, image_data)
    API Server (Proxy)->>+AsyncEngine: generate(prompt, image_data)
    AsyncEngine->>+Encoder Engine: step(ADD_MESSAGE, image_data)
    Encoder Engine->>Encoder Engine: Process image, generate feature embeddings
    Note right of Encoder Engine: Features stored in its local cache blocks
    Encoder Engine-->>-AsyncEngine: InferOutput (encoder_result)
    Note over AsyncEngine, Encoder Engine: encoder_result contains feature block_ids

    %% Phase 2: Feature Cache Migration
    AsyncEngine->>+LLM Engine: step(ADD_MESSAGE, prompt, encoder_result)
    LLM Engine->>LLM Engine: _on_add_message() -> _add_message()
    Note right of LLM Engine: Sequence created, status = WAITING_EP_MIGRATION
    LLM Engine->>LLM Engine: _async_loop_ep_migration() -> schedule_ep_migration()
    Note right of LLM Engine: Sequence status -> RUNNING_EP_MIGRATION
    LLM Engine->>LLM Engine: executor.migrate(encoder_blocks -> llm_blocks)
    Note right of LLM Engine: Physical copy of feature embeddings
    LLM Engine->>+P2P Connection (ZMQ): zmq_send(ack)
    P2P Connection (ZMQ)->>+Encoder Engine: Receive ACK
    Encoder Engine->>Encoder Engine: Free its copy of the feature cache blocks
    deactivate P2P Connection (ZMQ)
    deactivate Encoder Engine
    
    Note right of LLM Engine: After migration, sequence status -> WAITING

    %% Phase 3: LLM Prefill & Decode
    AsyncEngine->>+LLM Engine: step(empty request to trigger prefill)
    LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=True)
    Note right of LLM Engine: Sequence status -> RUNNING
    LLM Engine->>LLM Engine: executor.forward(prompt + migrated_features)
    Note right of LLM Engine: GPU performs prefill, generates 1st token
    LLM Engine-->>-AsyncEngine: InferOutput (1st token)
    AsyncEngine-->>API Server (Proxy): Stream 1st token
    API Server (Proxy)-->>Client: Stream 1st token

    loop Decode Loop
        AsyncEngine->>+LLM Engine: step(empty request to trigger decode)
        LLM Engine->>LLM Engine: _async_loop_main() -> schedule(is_prefill=False)
        LLM Engine->>LLM Engine: executor.forward(last_token)
        LLM Engine-->>-AsyncEngine: InferOutput (next_token)
        AsyncEngine-->>API Server (Proxy): Stream next_token
        API Server (Proxy)-->>Client: Stream next_token
    end

    %% Phase 4: Finish
    Note over LLM Engine, AsyncEngine: Generation finishes (EOS/max_tokens)
    LLM Engine-->>AsyncEngine: InferOutput (finish=True)
    AsyncEngine->>+LLM Engine: step(END_SESSION)
    LLM Engine->>LLM Engine: _on_end_session() -> scheduler.end_session()
    Note right of LLM Engine: Frees all resources for the session
    deactivate LLM Engine
    AsyncEngine-->>API Server (Proxy): Close stream
    API Server (Proxy)-->>Client: Close connection
    deactivate API Server (Proxy)
    deactivate AsyncEngine
Loading

PD distserve attaches migration_request to P instance response, and routes to D instance. Similarly, we propose a new attribute encoder_result attached to the E instance response, and routes to the PD instance.

  1. State transitions
State Diagram (click to expand)
stateDiagram-v2
    direction LR

    [*] --> WAITING_EPD_MIGRATION: Request with encoder_result
    [*] --> WAITING_MIGRATION: Request with migration_request
    [*] --> WAITING: Standard Request

    state "Encoder-Prefill-Decode Path" as EPD_Path {
        WAITING_EPD_MIGRATION --> RUNNING_EPD_MIGRATION: Scheduler._schedule_epd_migration()
        RUNNING_EPD_MIGRATION --> EPD_MIGRATION_LOCKED: Engine locks after migration
        EPD_MIGRATION_LOCKED --> WAITING: Engine unlocks, ready for prefill
    }

    state "Prefill-Decode Path" as PD_Path {
        WAITING_MIGRATION --> RUNNING_MIGRATION: Scheduler._schedule_migration()
        RUNNING_MIGRATION --> MIGRATION_LOCKED: Engine locks after migration
        MIGRATION_LOCKED --> MIGRATION_DONE: Engine unlocks
        MIGRATION_DONE --> RUNNING: Scheduler.collect_migration_done()
    }

    state "Standard Inference Path" as Standard_Path {
        WAITING --> RUNNING: Scheduler._schedule_prefill()
        RUNNING --> LOCKED: Engine locks for forward pass
        LOCKED --> RUNNING: Engine unlocks after forward pass
        RUNNING --> WAITING: Evicted during decode scheduling
    }

    RUNNING --> ENDED: Generation finished (EOS/max_tokens)
    RUNNING --> STOPPED: User cancelled
    RUNNING --> ABORTED: Error (e.g., OOM)

    STOPPED --> ENDED
    ABORTED --> ENDED
    ENDED --> [*]
Loading

To migrate features from the E instance to the PD instance, we add relevant scheduling logic inside the PyTorch engine. Specifically, we treat the scheduling and migration E -> PD as an extension of the current PD disaggregation, adding extra states such as WAITING_EPD_MIGRATION, RUNNING_EPD_MIGRATION, EPD_MIGRATION_LOCKED

Modifications

Modifications are threefold:

  1. Proxy Router / EPD Connections. Credit to @FirwoodLin
    -- New engine role 'Encoder'.
    -- Proxy routing.
    -- P2P connections/initializations.
  2. Multimodal Engine
    -- A separate engine for the encoder.
    -- A multimodal cache engine. Credit to @FirwoodLin
  3. LLM Engine
    -- Accept results from the encoder side.
    -- Schedule multimodal cache migration.

Performance

TODO

Tasks

  • Extensive refactoring and fixing
    -- Multimodal engine
    -- Minimal modifications to the LLM engine
    -- Proxy routing logic
  • Acc and performance test
    -- Multi-batch
    -- Metrics implementation
    -- Performance test and optimizations
  • Compatible with turbomind
  • Compatible with previous api server usages for VL models
  • Automatic p2p connection warmup
  • Disable LLM weight loading for MM engine
  • Extend EPD for QwenVL series and more

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant