Skip to content

Conversation

@Egor-Krivov
Copy link
Contributor

@Egor-Krivov Egor-Krivov commented Oct 20, 2025

Closes #5257

I also started reporting gbps to the database because many benchmarks are memory bound

@Egor-Krivov Egor-Krivov changed the title [vllm][benchmarks][WIP] Attention benchmark [benchmarks][vllm] Paged attention benchmark Nov 12, 2025
@Egor-Krivov Egor-Krivov marked this pull request as ready for review November 12, 2025 14:38
@etiotto etiotto requested review from anmyachev and vlad-penkin and removed request for vlad-penkin November 13, 2025 14:57
@Egor-Krivov Egor-Krivov requested a review from Copilot November 14, 2025 14:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a paged attention benchmark for the vLLM library, implementing both 2D and 3D unified attention kernels with tensor descriptor optimizations. The benchmark compares performance against PyTorch reference implementations and reports both throughput (GB/s) and compute (TFlops) metrics.

Key changes:

  • Implementation of unified attention benchmark with paged KV cache support
  • Enhanced memory bandwidth calculations accounting for actual token usage
  • Extended result transformation to report GB/s metrics alongside TFlops
  • CI/CD workflow updates to run the new benchmark

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
benchmarks/third_party/vllm/unified_attention_benchmark.py New comprehensive benchmark for vLLM's unified attention with 2D/3D kernels, supporting various model configurations and attention features
benchmarks/third_party/vllm/transform_results.py Enhanced to handle non-integer parameter values and report both TFlops and GB/s metrics
benchmarks/third_party/vllm/batched_moe_benchmark.py Improved memory bandwidth calculation to account for actual activated experts and token usage
.github/workflows/third-party-benchmarks.yml Added unified attention benchmark to CI workflow and improved command formatting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Egor-Krivov Egor-Krivov enabled auto-merge (squash) November 14, 2025 15:29
@Egor-Krivov Egor-Krivov changed the title [benchmarks][vllm] Paged attention benchmark [benchmarks][vllm] Unified Attention benchmark (paged attention) Nov 14, 2025
@Egor-Krivov Egor-Krivov merged commit 43d761e into main Nov 14, 2025
23 of 25 checks passed
@Egor-Krivov Egor-Krivov deleted the egor/vllm_attn branch November 14, 2025 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VLLM] Add benchmark for unified attention

3 participants