fix: Fix test and benchmark for trtllm-gen prefill batch size 1 #1912

bkryu · 2025-10-10T18:06:33Z

📌 Description

Current PR fixes the test and benchmark codes IMAs when running trtllm-gen paged & ragged prefill with batch size 1 -- the issue was described in #1898

Root cause of the issue: flashinfer.prefill.trtllm_ragged_attention_deepseek and flashinfer.prefill.trtllm_batch_context_with_kv_cache both require max_q_len to match the length of the query when batch size is 1.

Updated PR:
Issue has been addressed from the kernel-side so that the "max_q_len to match the length of the query when batch size is 1" is no longer required.

Current PR updates trtllm-gen FMHA cubins to latest.

Description of previous solution:
Updating max_q_len to cum_seq_lens_q[-1].item() within the trtllm_ragged_attention_deepseek or trtllm_batch_context_with_kv_cache functions are not a viable option because the CPU-side synchronization breaks the deterministic and fully device-side execution required during CUDA graph capture. The workaround was thus to update the test & benchmark codes that call the trtllm prefill functions, and clearly state in the docstring that when batch_size == 1, max_q_len must match the query size.

🔍 Related Issues

#1898

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

yzh119 · 2025-10-14T03:02:53Z

benchmarks/routines/attention.py

        kv_cache = torch.cat([k_fp8, v_fp8], dim=1)

+    if batch_size == 1:
+        # trtllm kernel requires max_q_len to be the same as the seqlen of the query when batch_size=1


Why qo_indptr[-1] could be different to s_qo, is it because we want to be compatible with cudagraphs and s_qo will always be the maximum length?

Short answer is yes.

Longer answer: In a batch_size > 1 situation, the CUDA graph containing prefill.trtllm_batch_context_with_kv_cache() can be reused with multiple sequence lengths but not when batch_size==1. For example,

If batch_size is 3 and we have two batches with query lengths [100, 200, 300] and [16, 500, 1024], we can set s_qo=1024, when we construct the CUDA graph and use the same CUDA graph for the two batches.

However for batch_size=1, where we have batches of query lengths [100] and [1024], a CUDA graph must be constructed each time -- first with s_qo=100 and second with s_qo=1024.

Not sure whether the above is a real concern at the framework level. Nevertheless, s_qo goes in as the max_q_len input argument where it is the max sequence length for query. We may at least want to consider whether the wording in the documentation is clear 😄

…special treatments

yzh119 · 2025-10-16T18:40:32Z

Hi @bkryu does upgrading to latest trtllm-gen fixing the issue?

bkryu · 2025-10-16T20:17:51Z

Hi @bkryu does upgrading to latest trtllm-gen fixing the issue?

Hi @yzh119, I'm currently checking. Upgrading to the latest trtllm-gen does fix the batch size 1 unit test, but I am seeing some errors in otherplaces. Will verify what is happening before marking the PR as ready

bkryu · 2025-10-16T20:40:01Z

/bot run

flashinfer-bot · 2025-10-16T20:40:46Z

GitLab MR !83 has been created, and the CI pipeline #36750562 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2025-10-17T01:01:23Z

[FAILED] Pipeline #36750562: 1/17 passed

bkryu self-assigned this Oct 10, 2025

yzh119 reviewed Oct 14, 2025

View reviewed changes

bkryu added 8 commits October 16, 2025 17:22

Add print statements

f013e17

Add repro script

ebbf58d

Prefill benchmark and test code has been changed

b3993d7

Cleanup. Add paged prefill batch size 1 case

471d3bc

Cleanup test case

02239c4

Cleanup for creating a prefill MR

a3db4b7

Fixing comment

edfd02f

Add missing params

197a7a0

bkryu force-pushed the trtllm-attention-debug branch from 4dade1b to 197a7a0 Compare October 16, 2025 17:23

bkryu added 3 commits October 16, 2025 17:28

Update cubin artifacts to latest trtllm-gen fmha. Undo batch_size==1 …

894254d

…special treatments

Undo change in prefill

e5cf729

Update checksums

acd3d5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix test and benchmark for trtllm-gen prefill batch size 1 #1912

fix: Fix test and benchmark for trtllm-gen prefill batch size 1 #1912

Uh oh!

bkryu commented Oct 10, 2025 •

edited

Loading

Uh oh!

yzh119 Oct 14, 2025

Uh oh!

bkryu Oct 14, 2025

Uh oh!

yzh119 commented Oct 16, 2025

Uh oh!

bkryu commented Oct 16, 2025 •

edited

Loading

Uh oh!

bkryu commented Oct 16, 2025

Uh oh!

flashinfer-bot commented Oct 16, 2025

Uh oh!

flashinfer-bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Fix test and benchmark for trtllm-gen prefill batch size 1 #1912

Are you sure you want to change the base?

fix: Fix test and benchmark for trtllm-gen prefill batch size 1 #1912

Uh oh!

Conversation

bkryu commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

yzh119 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

bkryu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Oct 16, 2025

Uh oh!

bkryu commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkryu commented Oct 16, 2025

Uh oh!

flashinfer-bot commented Oct 16, 2025

Uh oh!

flashinfer-bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkryu commented Oct 10, 2025 •

edited

Loading

bkryu commented Oct 16, 2025 •

edited

Loading