Skip to content

Conversation

bkryu
Copy link
Collaborator

@bkryu bkryu commented Oct 10, 2025

📌 Description

Current PR fixes the test and benchmark codes IMAs when running trtllm-gen paged & ragged prefill with batch size 1 -- the issue was described in #1898

Root cause of the issue: flashinfer.prefill.trtllm_ragged_attention_deepseek and flashinfer.prefill.trtllm_batch_context_with_kv_cache both require max_q_len to match the length of the query when batch size is 1.

Updated PR:
Issue has been addressed from the kernel-side so that the "max_q_len to match the length of the query when batch size is 1" is no longer required.

Current PR updates trtllm-gen FMHA cubins to latest.

Description of previous solution:
Updating max_q_len to cum_seq_lens_q[-1].item() within the trtllm_ragged_attention_deepseek or trtllm_batch_context_with_kv_cache functions are not a viable option because the CPU-side synchronization breaks the deterministic and fully device-side execution required during CUDA graph capture. The workaround was thus to update the test & benchmark codes that call the trtllm prefill functions, and clearly state in the docstring that when batch_size == 1, max_q_len must match the query size.

🔍 Related Issues

#1898

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@bkryu bkryu self-assigned this Oct 10, 2025
kv_cache = torch.cat([k_fp8, v_fp8], dim=1)

if batch_size == 1:
# trtllm kernel requires max_q_len to be the same as the seqlen of the query when batch_size=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why qo_indptr[-1] could be different to s_qo, is it because we want to be compatible with cudagraphs and s_qo will always be the maximum length?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer is yes.

Longer answer: In a batch_size > 1 situation, the CUDA graph containing prefill.trtllm_batch_context_with_kv_cache() can be reused with multiple sequence lengths but not when batch_size==1. For example,

  • If batch_size is 3 and we have two batches with query lengths [100, 200, 300] and [16, 500, 1024], we can set s_qo=1024, when we construct the CUDA graph and use the same CUDA graph for the two batches.
  • However for batch_size=1, where we have batches of query lengths [100] and [1024], a CUDA graph must be constructed each time -- first with s_qo=100 and second with s_qo=1024.

Not sure whether the above is a real concern at the framework level. Nevertheless, s_qo goes in as the max_q_len input argument where it is the max sequence length for query. We may at least want to consider whether the wording in the documentation is clear 😄

@bkryu bkryu force-pushed the trtllm-attention-debug branch from 4dade1b to 197a7a0 Compare October 16, 2025 17:23
@yzh119
Copy link
Collaborator

yzh119 commented Oct 16, 2025

Hi @bkryu does upgrading to latest trtllm-gen fixing the issue?

@bkryu
Copy link
Collaborator Author

bkryu commented Oct 16, 2025

Hi @bkryu does upgrading to latest trtllm-gen fixing the issue?

Hi @yzh119, I'm currently checking. Upgrading to the latest trtllm-gen does fix the batch size 1 unit test, but I am seeing some errors in otherplaces. Will verify what is happening before marking the PR as ready

@bkryu
Copy link
Collaborator Author

bkryu commented Oct 16, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !83 has been created, and the CI pipeline #36750562 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #36750562: 1/17 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants