-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Core] Async Scheduling X Spec Decoding Compatibility #24799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Async Scheduling X Spec Decoding Compatibility #24799
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.
f417e8f to
b530bf3
Compare
8172c2b to
163f9ab
Compare
4466156 to
f971753
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
13773be to
337aab8
Compare
3630428 to
3ad3c1b
Compare
@Ronald1995 It's not related to our internal approach, nor does it mean that this implementation is bad. I really appreciate you submitting such an great PR. What I'm saying is that with the current PR, there are gaps under certain circumstances. Does this align with the expected behavior? |
|
@lhtin I see what you mean. This is a consequence of having to wait for the main model forward pass to finish before we can start prepare_inputs in the next step. Since we're only overlapping with the drafting here, if the drafting is quick then we can have a blocking sync. This is a known compromise of this PR. I think it would add a bit of complexity to address this, but it wouldn't be a big change to the diff in the PR. I think we should continue with this proposal as-is and treat your proposed change as a potential performance improvement for future work. |
|
@lhtin I also wonder if that approach would work for all attention backends. I know some attention backends require the metadata on both the CPU and GPU in order to run properly, and that might be a challenge to wrestle with the current abstractions. |
Signed-off-by: Nick Hill <nhill@redhat.com>
…heduling_for_spec_decode
|
@Ronald1995 @benchislett ok I have pushed the test update and running full CI now |
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Understood, I completely agree to address this potential issue in a separate PR.
Indeed, this would add significant complexity. This aspect requires careful analysis to strike a balance between performance and maintainability. |
|
@Ronald1995 @benchislett unfortunately this test runs on an L4 which doesn't have enough vram to load the I'm not sure of the best way to proceed... it would be really nice if there was a smaller test model we could use for the spec decoding cases! |
@njhill i try to find a smaller model, but it's hard to find one. maybe we can set |
|
I'm looking into getting a eagle model trained for llama-3.2-1B which we could use here. In the meantime I think we can merge this with the spec decode tests marked as skipped. I have opened a separate PR with the test rework though which would be good to merge first, then can rebase this one: #28744 |
…spec_decode # Conflicts: # tests/v1/e2e/test_async_scheduling.py
Signed-off-by: Nick Hill <nhill@redhat.com>
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @Ronald1995 for all your work on this.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
…4799) Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: jiang1.li <jiang1.li@intel.com>
…4799) Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Purpose
PR #19970 implements async_scheduling, PR #23569 implement
prepare_inputoverlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the
_draft_token_idsdirectly for next stepexecute_model.because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.
Test Plan
we will make the e2e test.
Test config:
test device: Nvidia A100
Test Result
performance
precision
I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.