-
Notifications
You must be signed in to change notification settings - Fork 75
[benchmarks][vllm] Unified Attention benchmark (paged attention) #5348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a paged attention benchmark for the vLLM library, implementing both 2D and 3D unified attention kernels with tensor descriptor optimizations. The benchmark compares performance against PyTorch reference implementations and reports both throughput (GB/s) and compute (TFlops) metrics.
Key changes:
- Implementation of unified attention benchmark with paged KV cache support
- Enhanced memory bandwidth calculations accounting for actual token usage
- Extended result transformation to report GB/s metrics alongside TFlops
- CI/CD workflow updates to run the new benchmark
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| benchmarks/third_party/vllm/unified_attention_benchmark.py | New comprehensive benchmark for vLLM's unified attention with 2D/3D kernels, supporting various model configurations and attention features |
| benchmarks/third_party/vllm/transform_results.py | Enhanced to handle non-integer parameter values and report both TFlops and GB/s metrics |
| benchmarks/third_party/vllm/batched_moe_benchmark.py | Improved memory bandwidth calculation to account for actual activated experts and token usage |
| .github/workflows/third-party-benchmarks.yml | Added unified attention benchmark to CI workflow and improved command formatting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closes #5257
I also started reporting gbps to the database because many benchmarks are memory bound