-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Core] NGram GPU Implementation compatible with Async Scheduler #29184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…rs (vllm-project#29111) Signed-off-by: Huamin Li <3ericli@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
6dea7df to
4534c88
Compare
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
…vllm into patchy/async_ngram
Signed-off-by: PatchouliTaisa <patchychen@tencent.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
cc @njhill |
| # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE. | ||
| disable_cache = not is_compile_cache_enabled(self.inductor_config) | ||
|
|
||
| # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Can this be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I enabled torch compile in the ngram gpu kernel, the computational graph corresponding to ngram operator would hit a precompiled computational graph cache in the main model, leading to mismatched computational graph results. Therefore, I directly disabled the compile cache here. I tested this locally, and disabling the cache had no impact on performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume disabling the compile cache would lead to longer startup time? I'm not an expert here but maybe it's possible to add an identifier to the compile cache to avoid extraneous cache hits?
| pin_memory=False, | ||
| ) | ||
| self.token_ids_cpu = self.token_ids_cpu_tensor.numpy() | ||
| self.token_ids_gpu_tensor = torch.zeros( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a massive buffer, and can go up to 1GB of VRAM in normal use-cases. Is there anything that can be done about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both the ngram gpu computation and inputs preparation of ngram can benefit from the buffer, I think it is worth to maintain the buffer. BTW I'm confused about the size, considering set max_model_len as 128k, it will take approximately 1600 max_num_seqs to reach the VRAM size of 1GB, is it a normal use-cases? Besides users can specify the value of max_num_seqs and max_model_len as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Some models have larger max context length than 128K (Qwen3 has 256K, Llama4 has 1M+)
- Deployments often have max_num_seqs between 512 and 2048. I would consider this to be a normal range.
| all_token_ids = prompt_token_ids + req_state.output_token_ids | ||
| num_tokens = len(all_token_ids) | ||
| # Copy to GPU tensor | ||
| self.input_batch.token_ids_gpu_tensor[idx, :num_tokens].copy_( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this copy is copying from a device tensor, instead of a standard HtoD copy. Why is that?
| ), | ||
| non_blocking=True, | ||
| ) | ||
| self.input_batch.num_tokens_no_spec_gpu[idx] = num_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be integrated into _update_states, where num_tokens_no_spec (cpu) is maintained? That seems like it would be cleaner than recomputing it twice and copying over here. Also, we would not want to maintain two pieces of the same logic.
| for i, num_tokens in enumerate(num_accepted_tokens): | ||
| self.input_batch.num_accepted_tokens_cpu[i] = num_tokens | ||
|
|
||
| def _update_ngram_gpu_tensors(self, scheduler_output: "SchedulerOutput") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any unique logic in here that is distinct from how we maintain token_ids_cpu_tensor and num_tokens_no_spec_cpu?
| @support_torch_compile( | ||
| dynamic_arg_dims={ | ||
| "num_tokens_no_spec": 0, | ||
| "token_ids_gpu": [0, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when both dims are marked as dynamic? Does it recompile for every combination of BS and SeqLen? If so, isn't that way too much compilation?
| with set_forward_context(None, self.vllm_config): | ||
| _ = self.kernel(num_tokens, token_ids, combined_mask) | ||
|
|
||
| def _generate_dummy_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the dummy run need actual synthetic data? Can't it just use random tokens? Isn't the point of the dummy run just to initialize/record the shapes and buffers?
| combined_mask = ( | ||
| sampled_flags | ||
| & valid_mask | ||
| & (num_tokens_no_spec < self.max_model_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could this happen?
| combined_mask, | ||
| ) | ||
|
|
||
| def prepare_next_token_ids_cpu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't duplicate functions like this. Refactor and reuse the code.
| combined_mask, | ||
| ) | ||
|
|
||
| def prepare_next_token_ids_cpu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not seem to be used.
benchislett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only major concern with this PR right now is maintaining the code. Between this PR and the draft model support, we are duplicating/reusing many components of the original EAGLE implementation in a naive manner. We should carefully structure the code here to implement these drafters more cleanly
Purpose
This PR is based on PR #24799 aiming to implement GPU version of ngram speculative decoding and make it compatible with Async Scheduler.
Test Plan
Test config:
Test Device: NVIDIA H20
Test Result
Performance
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.