[POC] Encoder Disaggregation #4047
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Not ready to be merged or fully reviewed yet.
However, since we have already implemented the essential building blocks and passed my naive single request test, I propose a draft PR here for whom it may concern.
Motivation
InternVL3.5 proposed
Decoupled Vision Deployment (DvD)
, also known asEncode-Prefill-Decode Disaggregation (EPD)
in many other papers. We use the termEPD
in the following descriptions. This paradigm has the potential to improve Time-to-First-Token (TTFT) and throughput.Design
In this section, we explain the design logic of EPD in lmdeploy.
Sequence Diagram (click to expand)
PD distserve attaches
migration_request
to P instance response, and routes to D instance. Similarly, we propose a new attributeencoder_result
attached to the E instance response, and routes to the PD instance.State Diagram (click to expand)
To migrate features from the E instance to the PD instance, we add relevant scheduling logic inside the PyTorch engine. Specifically, we treat the scheduling and migration E -> PD as an extension of the current PD disaggregation, adding extra states such as
WAITING_EPD_MIGRATION, RUNNING_EPD_MIGRATION, EPD_MIGRATION_LOCKED
Modifications
Modifications are threefold:
-- New engine role 'Encoder'.
-- Proxy routing.
-- P2P connections/initializations.
-- A separate engine for the encoder.
-- A multimodal cache engine. Credit to @FirwoodLin
-- Accept results from the encoder side.
-- Schedule multimodal cache migration.
Performance
TODO
Tasks
-- Multimodal engine
-- Minimal modifications to the LLM engine
-- Proxy routing logic
-- Multi-batch
-- Metrics implementation
-- Performance test and optimizations
Related