Skip to content

Conversation

@Ronald1995
Copy link
Contributor

@Ronald1995 Ronald1995 commented Sep 13, 2025

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

  • async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts async_scheduling(tps) sync_scheduling(tps) speedup
24 2356 2314 1.8%
48 3759 3539 6.2%
96 5110 4770 7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Sep 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57
@mergify mergify bot removed the needs-rebase label Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42
@robertgshaw2-redhat robertgshaw2-redhat changed the title async_scheduling for sepc code [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29
@mergify
Copy link

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 18, 2025
@Ronald1995 Ronald1995 changed the title [Core] Async Scheduling X Spec Decoding Compatibility [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51
@Ronald1995 Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51
@mergify mergify bot removed the needs-rebase label Sep 20, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20
@lhtin
Copy link
Contributor

lhtin commented Nov 13, 2025

@lhtin I got your point, your internal approach has big difference between this PR. my opinion is you could new another PR to implement your approach. of course, it depends on the opinions of nick and ben @benchislett @njhill

@Ronald1995 It's not related to our internal approach, nor does it mean that this implementation is bad. I really appreciate you submitting such an great PR.

What I'm saying is that with the current PR, there are gaps under certain circumstances. Does this align with the expected behavior?

@benchislett
Copy link
Collaborator

benchislett commented Nov 13, 2025

@lhtin I see what you mean. This is a consequence of having to wait for the main model forward pass to finish before we can start prepare_inputs in the next step. Since we're only overlapping with the drafting here, if the drafting is quick then we can have a blocking sync.

This is a known compromise of this PR. I think it would add a bit of complexity to address this, but it wouldn't be a big change to the diff in the PR. I think we should continue with this proposal as-is and treat your proposed change as a potential performance improvement for future work.

@benchislett
Copy link
Collaborator

@lhtin I also wonder if that approach would work for all attention backends. I know some attention backends require the metadata on both the CPU and GPU in order to run properly, and that might be a challenge to wrestle with the current abstractions.

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
@njhill
Copy link
Member

njhill commented Nov 13, 2025

@Ronald1995 @benchislett ok I have pushed the test update and running full CI now

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill requested a review from 22quinn as a code owner November 14, 2025 00:32
Signed-off-by: Nick Hill <nhill@redhat.com>
@lhtin
Copy link
Contributor

lhtin commented Nov 14, 2025

This is a known compromise of this PR. I think it would add a bit of complexity to address this, but it wouldn't be a big change to the diff in the PR. I think we should continue with this proposal as-is and treat your proposed change as a potential performance improvement for future work.

Understood, I completely agree to address this potential issue in a separate PR.

@lhtin I also wonder if that approach would work for all attention backends. I know some attention backends require the metadata on both the CPU and GPU in order to run properly, and that might be a challenge to wrestle with the current abstractions.

Indeed, this would add significant complexity. This aspect requires careful analysis to strike a balance between performance and maintainability.

@njhill
Copy link
Member

njhill commented Nov 14, 2025

@Ronald1995 @benchislett unfortunately this test runs on an L4 which doesn't have enough vram to load the XiaomiMiMo/MiMo-7B-Base model in fp32, and is < sm_90 so the batch invariance feature doesn't work. One of these is needed to ensure stability of the outputs or else the correctness checks don't work.

I'm not sure of the best way to proceed... it would be really nice if there was a smaller test model we could use for the spec decoding cases!

@Ronald1995
Copy link
Contributor Author

I'm not sure of the best way to proceed... it would be really nice if there was a smaller test model we could use for the spec decoding cases!

@njhill i try to find a smaller model, but it's hard to find one. maybe we can set max_num_seqs=1 to avoid batch size's influence on output and use less prompts to accelerate the test?

@njhill
Copy link
Member

njhill commented Nov 14, 2025

I'm looking into getting a eagle model trained for llama-3.2-1B which we could use here. In the meantime I think we can merge this with the spec decode tests marked as skipped.

I have opened a separate PR with the test rework though which would be good to merge first, then can rebase this one: #28744

…spec_decode

# Conflicts:
#	tests/v1/e2e/test_async_scheduling.py
Signed-off-by: Nick Hill <nhill@redhat.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Ronald1995 for all your work on this.

@mergify
Copy link

mergify bot commented Nov 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 15, 2025
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@mergify mergify bot removed the needs-rebase label Nov 15, 2025
@njhill njhill merged commit d8874c6 into vllm-project:main Nov 17, 2025
46 checks passed
Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025
…4799)

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
…4799)

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants