[Question] Different  performance of vLLM server and Triton Inference Server

I deployed the Qwen-7B-VL model using both vLLM serve and Triton Inference Server VLLM backend. I used the same model, GPU resources, dataset, and stress test parameters, but got different performance results.
vllm engine server: [core.py:61] Initializing a V1 LLM engine (v0.9.0.pre1+1958ee56.nv25.06)

Vllm:
Engine | GPU | Model Name | Dataset | Total requests | Total Input/Output Length | QPS | Duration(s)
-- | -- | -- | -- | -- | -- | -- | --
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159397 | 0.1 | 10044.81
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159250 | 0.2 | 5021.97
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158921 | 0.4 | 2511.06
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158649 | 0.8 | 1255.53
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159001 | 1 | 1004.49
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159208 | 2 | 502.66
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160258 | 4 | 252.25
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160102 | 8 | 127.35
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158758 | 10 | 102.37
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1 | 381/126 | 1 | 0.9

Tritonserver:

Server | Engine | GPU | Model Name | Dataset | Total requests | Total Input/Output Length | QPS | Duration(s)
-- | -- | -- | -- | -- | -- | -- | -- | --
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159927 | 0.1 | 10043.13
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159866 | 0.2 | 5021.72
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159750 | 0.4 | 2510.75
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159575 | 0.8 | 1255.91
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160034 | 1 | 1005.31
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159930 | 2 | 669.24
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159619 | 4 | 659.69
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159797 | 8 | 650.24
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159810 | 10 | 648.73
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1 | 381/122 | 1 | 1.49

All tests use streaming responses, and the benchmark duration refers to the total time required to complete all requests. As QPS increases, the Triton server takes significantly more time than vLLM serve. This raises the question of whether the additional latency is caused by the extra overhead in the Triton server’s vLLM backend when packaging responses.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Different performance of vLLM server and Triton Inference Server #8375

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Engine	GPU	Model Name	Dataset	Total requests	Total Input/Output Length	QPS	Duration(s)
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159397	0.1	10044.81
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159250	0.2	5021.97
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/158921	0.4	2511.06
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/158649	0.8	1255.53
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159001	1	1004.49
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159208	2	502.66
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/160258	4	252.25
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/160102	8	127.35
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/158758	10	102.37
vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1	381/126	1	0.9

Server	Engine	GPU	Model Name	Dataset	Total requests	Total Input/Output Length	QPS	Duration(s)
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159927	0.1	10043.13
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159866	0.2	5021.72
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159750	0.4	2510.75
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159575	0.8	1255.91
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/160034	1	1005.31
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159930	2	669.24
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159619	4	659.69
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159797	8	650.24
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1000	386219/159810	10	648.73
Triton	vLLM V1	1*H100	Qwen2.5-VL-7B-Instruct	ShareGPT4V	1	381/122	1	1.49

[Question] Different performance of vLLM server and Triton Inference Server #8375

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions