-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bugfix] Spec decode + structured output + spec model max len edge case #28298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Spec decode + structured output + spec model max len edge case #28298
Conversation
Signed-off-by: Andy Lo <andy@mistral.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a solid fix for a critical bug involving speculative decoding and structured outputs, where stale speculative tokens could cause an assertion failure. The main change in vllm/v1/core/sched/scheduler.py to explicitly clear request.spec_token_ids after each scheduling step is correct and directly addresses the root cause. The subsequent simplification in update_draft_token_ids is a good cleanup that improves code clarity. Furthermore, the new test case in tests/v1/spec_decode/test_max_len.py is well-constructed to reproduce the specific edge case and prevent future regressions. The changes are of high quality and I see no issues.
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @andylolu2 for this fix! And for updating the test.
Does your test update now exercise the input_fits_in_drafter == False case? If so that's great, I had just noticed yesterday that there was no test added when that check was put in.
@njhill I've updated the PR to address the comments, thanks for the review! Yes the test now assertions the prompt_len is less than draft_max_len and prompt_len + output_len is greater than draft_max_len, so it triggers the |
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @andylolu2
…se (vllm-project#28298) Signed-off-by: Andy Lo <andy@mistral.ai> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Fixes #27210
Error:
Cause:
The current implementation relies on
scheduler.update_draft_token_idsto be called to reset previously-used spec_token_ids. However,scheduler.update_draft_token_idsis not guaranteed to be called after every step (currently it happens when the model_execute did not produce any new draft tokens, e.g. wheninput_fits_in_drafteris False). This results in stale spec_token_ids that don't necessarily follow the grammar after accepting the output tokens from the current step, leading to the above assertion error.Proposed fix: Clear the spec token ids of a request immediately once it has been scheduled.
Test Plan
Extended current tests to cover this edge case.
Test Result
No more assertion errors.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.