[Core] NGram GPU Implementation compatible with Async Scheduler #29184

PatchouliTIS · 2025-11-21T14:42:44Z

Purpose

This PR is based on PR #24799 aiming to implement GPU version of ngram speculative decoding and make it compatible with Async Scheduler.

Test Plan

Async Scheduler + NGram + Qwen3-1.7B
Test config:

# dataset is CMU-DoG, which is an input-grounded dataset.
python3.12 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--max-num-seqs 128 \
--max-model-len 2048 \
--model Qwen/Qwen3-1.7B \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype bfloat16  \
--enable-chunked-prefill \
--disable-log-requests \
--async-scheduling \
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

Test Device: NVIDIA H20

Test Result

Performance

num_prompts	async_ngram(tps)	sync_ngram(tps)	speedup
2	466	357	30.5%
8	1378	988	39.4%
16	2082	1726	20.6%

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…rs (vllm-project#29111) Signed-off-by: Huamin Li <3ericli@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

…vllm into patchy/async_ngram

Signed-off-by: PatchouliTaisa <patchychen@tencent.com>

chatgpt-codex-connector · 2025-11-26T08:22:00Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

ZJY0516 · 2025-11-27T04:25:32Z

cc @njhill

benchislett · 2025-11-27T14:46:42Z

vllm/compilation/backends.py

        # Honors opt-outs such as CompilationMode.NONE or VLLM_DISABLE_COMPILE_CACHE.
        disable_cache = not is_compile_cache_enabled(self.inductor_config)

+        # TODO(patchy): ngram gpu kernel will cause vllm torch compile cache errors.


Why? Can this be fixed?

I enabled torch compile in the ngram gpu kernel, the computational graph corresponding to ngram operator would hit a precompiled computational graph cache in the main model, leading to mismatched computational graph results. Therefore, I directly disabled the compile cache here. I tested this locally, and disabling the cache had no impact on performance.

I assume disabling the compile cache would lead to longer startup time? I'm not an expert here but maybe it's possible to add an identifier to the compile cache to avoid extraneous cache hits?

benchislett · 2025-11-27T14:51:02Z

vllm/v1/worker/gpu_input_batch.py

            pin_memory=False,
        )
        self.token_ids_cpu = self.token_ids_cpu_tensor.numpy()
+        self.token_ids_gpu_tensor = torch.zeros(


This is a massive buffer, and can go up to 1GB of VRAM in normal use-cases. Is there anything that can be done about this?

both the ngram gpu computation and inputs preparation of ngram can benefit from the buffer, I think it is worth to maintain the buffer. BTW I'm confused about the size, considering set max_model_len as 128k, it will take approximately 1600 max_num_seqs to reach the VRAM size of 1GB, is it a normal use-cases? Besides users can specify the value of max_num_seqs and max_model_len as well.

Some models have larger max context length than 128K (Qwen3 has 256K, Llama4 has 1M+)

Deployments often have max_num_seqs between 512 and 2048. I would consider this to be a normal range.

benchislett · 2025-11-27T14:54:21Z

vllm/v1/worker/gpu_model_runner.py

+                    all_token_ids = prompt_token_ids + req_state.output_token_ids
+                    num_tokens = len(all_token_ids)
+                    # Copy to GPU tensor
+                    self.input_batch.token_ids_gpu_tensor[idx, :num_tokens].copy_(


It looks like this copy is copying from a device tensor, instead of a standard HtoD copy. Why is that?

benchislett · 2025-11-27T14:58:27Z

vllm/v1/worker/gpu_model_runner.py

+                        ),
+                        non_blocking=True,
+                    )
+                    self.input_batch.num_tokens_no_spec_gpu[idx] = num_tokens


Can this logic be integrated into _update_states, where num_tokens_no_spec (cpu) is maintained? That seems like it would be cleaner than recomputing it twice and copying over here. Also, we would not want to maintain two pieces of the same logic.

benchislett · 2025-11-27T15:00:59Z

vllm/v1/worker/gpu_model_runner.py

        for i, num_tokens in enumerate(num_accepted_tokens):
            self.input_batch.num_accepted_tokens_cpu[i] = num_tokens

+    def _update_ngram_gpu_tensors(self, scheduler_output: "SchedulerOutput") -> None:


Is there any unique logic in here that is distinct from how we maintain token_ids_cpu_tensor and num_tokens_no_spec_cpu?

benchislett · 2025-11-27T15:14:55Z