Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions inference/trillium/vLLM/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# Serve vLLM on Trillium TPUs (v6e)

This repository provides examples demonstrating how to deploy and serve vLLM on Trillium TPUs using GCE (Google Compute Engine) for a select set of models.
Although vLLM TPU’s [new unified backend](https://github.com/vllm-project/tpu-inference) makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, we’ve provided a set of stress-tested recipes for deploying and serving vLLM on Trillium TPUs using Google Compute Engine (GCE).

- [Llama3.1-8B/70B](./Llama3.1/README.md)
- [Qwen2.5-32B](./Qwen2.5-32B/README.md)
- [Qwen2.5-VL-7B](./Qwen2.5-VL/README.md)
- [Qwen3-4B/32B](./Qwen3/README.md)

These models were chosen for demonstration purposes only. You can serve any model from this list: [vLLM Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)

If you are looking for GKE-based deployment, please refer to this documentation: [Serve an LLM using TPU Trillium on GKE with vLLM](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-vllm-tpu)

Please consult the [Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features.html) page for a list of models and features that are validated through unit, integration, and performance testing.