-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Description
I deployed the Qwen-7B-VL model using both vLLM serve and Triton Inference Server VLLM backend. I used the same model, GPU resources, dataset, and stress test parameters, but got different performance results.
vllm engine server: [core.py:61] Initializing a V1 LLM engine (v0.9.0.pre1+1958ee56.nv25.06)
Vllm:
Engine | GPU | Model Name | Dataset | Total requests | Total Input/Output Length | QPS | Duration(s) |
---|---|---|---|---|---|---|---|
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159397 | 0.1 | 10044.81 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159250 | 0.2 | 5021.97 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158921 | 0.4 | 2511.06 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158649 | 0.8 | 1255.53 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159001 | 1 | 1004.49 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159208 | 2 | 502.66 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160258 | 4 | 252.25 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160102 | 8 | 127.35 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/158758 | 10 | 102.37 |
vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1 | 381/126 | 1 | 0.9 |
Tritonserver:
Server | Engine | GPU | Model Name | Dataset | Total requests | Total Input/Output Length | QPS | Duration(s) |
---|---|---|---|---|---|---|---|---|
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159927 | 0.1 | 10043.13 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159866 | 0.2 | 5021.72 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159750 | 0.4 | 2510.75 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159575 | 0.8 | 1255.91 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/160034 | 1 | 1005.31 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159930 | 2 | 669.24 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159619 | 4 | 659.69 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159797 | 8 | 650.24 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1000 | 386219/159810 | 10 | 648.73 |
Triton | vLLM V1 | 1*H100 | Qwen2.5-VL-7B-Instruct | ShareGPT4V | 1 | 381/122 | 1 | 1.49 |
All tests use streaming responses, and the benchmark duration refers to the total time required to complete all requests. As QPS increases, the Triton server takes significantly more time than vLLM serve. This raises the question of whether the additional latency is caused by the extra overhead in the Triton server’s vLLM backend when packaging responses.
Metadata
Metadata
Assignees
Labels
No labels