[Core] Async Scheduling X Spec Decoding Compatibility #24799

Ronald1995 · 2025-09-13T07:38:11Z

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code， make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts	async_scheduling(tps)	sync_scheduling(tps)	speedup
24	2356	2314	1.8%
48	3759	3539	6.2%
96	5110	4770	7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-13T07:38:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

vllm/v1/worker/gpu_model_runner.py

mergify · 2025-09-18T05:41:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

lhtin · 2025-11-13T06:13:38Z

@lhtin I got your point, your internal approach has big difference between this PR. my opinion is you could new another PR to implement your approach. of course, it depends on the opinions of nick and ben @benchislett @njhill

@Ronald1995 It's not related to our internal approach, nor does it mean that this implementation is bad. I really appreciate you submitting such an great PR.

What I'm saying is that with the current PR, there are gaps under certain circumstances. Does this align with the expected behavior?

benchislett · 2025-11-13T16:34:39Z

@lhtin I see what you mean. This is a consequence of having to wait for the main model forward pass to finish before we can start prepare_inputs in the next step. Since we're only overlapping with the drafting here, if the drafting is quick then we can have a blocking sync.

This is a known compromise of this PR. I think it would add a bit of complexity to address this, but it wouldn't be a big change to the diff in the PR. I think we should continue with this proposal as-is and treat your proposed change as a potential performance improvement for future work.

benchislett · 2025-11-13T17:01:03Z

@lhtin I also wonder if that approach would work for all attention backends. I know some attention backends require the metadata on both the CPU and GPU in order to run properly, and that might be a challenge to wrestle with the current abstractions.

Signed-off-by: Nick Hill <nhill@redhat.com>

…heduling_for_spec_decode

njhill · 2025-11-13T22:35:44Z

@Ronald1995 @benchislett ok I have pushed the test update and running full CI now

Signed-off-by: Nick Hill <nhill@redhat.com>

lhtin · 2025-11-14T04:17:28Z

This is a known compromise of this PR. I think it would add a bit of complexity to address this, but it wouldn't be a big change to the diff in the PR. I think we should continue with this proposal as-is and treat your proposed change as a potential performance improvement for future work.

Understood, I completely agree to address this potential issue in a separate PR.

@lhtin I also wonder if that approach would work for all attention backends. I know some attention backends require the metadata on both the CPU and GPU in order to run properly, and that might be a challenge to wrestle with the current abstractions.

Indeed, this would add significant complexity. This aspect requires careful analysis to strike a balance between performance and maintainability.

njhill · 2025-11-14T04:40:22Z

@Ronald1995 @benchislett unfortunately this test runs on an L4 which doesn't have enough vram to load the XiaomiMiMo/MiMo-7B-Base model in fp32, and is < sm_90 so the batch invariance feature doesn't work. One of these is needed to ensure stability of the outputs or else the correctness checks don't work.

I'm not sure of the best way to proceed... it would be really nice if there was a smaller test model we could use for the spec decoding cases!

Ronald1995 · 2025-11-14T07:23:16Z

I'm not sure of the best way to proceed... it would be really nice if there was a smaller test model we could use for the spec decoding cases!

@njhill i try to find a smaller model, but it's hard to find one. maybe we can set max_num_seqs=1 to avoid batch size's influence on output and use less prompts to accelerate the test?

njhill · 2025-11-14T19:48:50Z

I'm looking into getting a eagle model trained for llama-3.2-1B which we could use here. In the meantime I think we can merge this with the spec decode tests marked as skipped.

I have opened a separate PR with the test rework though which would be good to merge first, then can rebase this one: #28744

…spec_decode # Conflicts: # tests/v1/e2e/test_async_scheduling.py

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill

Thanks again @Ronald1995 for all your work on this.

mergify · 2025-11-15T06:55:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

…4799) Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: jiang1.li <jiang1.li@intel.com>

…4799) Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

Ronald1995 requested review from WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners September 13, 2025 07:38

mergify bot added the v1 label Sep 13, 2025

mergify bot added the needs-rebase label Sep 13, 2025

gemini-code-assist bot reviewed Sep 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57

mergify bot removed the needs-rebase label Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42

Ronald1995 requested review from benchislett and luccafong as code owners September 13, 2025 09:42

mergify bot added the speculative-decoding label Sep 13, 2025

robertgshaw2-redhat changed the title ~~async_scheduling for sepc code~~ [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29

benchislett mentioned this pull request Sep 16, 2025

[Spec Decode] Efficient padded speculation #24539

Merged

mergify bot added the needs-rebase label Sep 18, 2025

Ronald1995 changed the title ~~[Core] Async Scheduling X Spec Decoding Compatibility~~ [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51

Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51

mergify bot removed the needs-rebase label Sep 20, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20

njhill added 2 commits November 13, 2025 14:27

rework test_async_scheduling.py

485ca68

Signed-off-by: Nick Hill <nhill@redhat.com>

Merge remote-tracking branch 'refs/remotes/origin/main' into async_sc…

7f65c3d

…heduling_for_spec_decode

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

try VLLM_BATCH_INVARIANT instead of float32

65a585a

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill requested a review from 22quinn as a code owner November 14, 2025 00:32

increase gpu_memory_utilization

fb0b224

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill added 2 commits November 14, 2025 20:47

Merge remote-tracking branch 'origin/main' into async_scheduling_for_…

87249b9

…spec_decode # Conflicts: # tests/v1/e2e/test_async_scheduling.py

update test with llama3.2-1b-instruct eagle3

f6ae50f

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill approved these changes Nov 15, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 15, 2025

merge origin/main into Ronald1995/aysnc_scheduling_for_spec_decode

2a2c10c

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

mergify bot removed the needs-rebase label Nov 15, 2025

Merge branch 'main' into async_scheduling_for_spec_decode

e26a9dc

njhill merged commit d8874c6 into vllm-project:main Nov 17, 2025
46 checks passed

xinli-sw mentioned this pull request Nov 18, 2025

[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements #28947

Open

20 tasks

Neo9061 mentioned this pull request Nov 18, 2025

async cache update snowflakedb/ArcticInference#208

Open

PatchouliTIS mentioned this pull request Nov 21, 2025

[Core] NGram GPU Implementation compatible with Async Scheduler #29184

Open

5 tasks

jesse996 mentioned this pull request Nov 26, 2025

Async Scheduling X Spec Decoding Compatibility vllm-project/vllm-ascend#4464

Open

Uh oh!

[Core] Async Scheduling X Spec Decoding Compatibility #24799

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Uh oh!

Conversation

Ronald1995 commented Sep 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

performance

precision

Uh oh!

mergify bot commented Sep 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

lhtin commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Nov 13, 2025

Uh oh!

njhill commented Nov 13, 2025

Uh oh!

lhtin commented Nov 14, 2025

Uh oh!

njhill commented Nov 14, 2025

Uh oh!

Ronald1995 commented Nov 14, 2025

Uh oh!

njhill commented Nov 14, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Ronald1995 commented Sep 13, 2025 •

edited by github-actions bot

Loading

lhtin commented Nov 13, 2025 •

edited

Loading

benchislett commented Nov 13, 2025 •

edited

Loading