Skip to content

[Question] Different performance of vLLM server and Triton Inference Server #8375

@AnyangAngus

Description

@AnyangAngus

I deployed the Qwen-7B-VL model using both vLLM serve and Triton Inference Server VLLM backend. I used the same model, GPU resources, dataset, and stress test parameters, but got different performance results.
vllm engine server: [core.py:61] Initializing a V1 LLM engine (v0.9.0.pre1+1958ee56.nv25.06)

Vllm:

Engine GPU Model Name Dataset Total requests Total Input/Output Length QPS Duration(s)
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159397 0.1 10044.81
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159250 0.2 5021.97
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/158921 0.4 2511.06
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/158649 0.8 1255.53
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159001 1 1004.49
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159208 2 502.66
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/160258 4 252.25
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/160102 8 127.35
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/158758 10 102.37
vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1 381/126 1 0.9

Tritonserver:

Server Engine GPU Model Name Dataset Total requests Total Input/Output Length QPS Duration(s)
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159927 0.1 10043.13
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159866 0.2 5021.72
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159750 0.4 2510.75
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159575 0.8 1255.91
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/160034 1 1005.31
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159930 2 669.24
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159619 4 659.69
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159797 8 650.24
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1000 386219/159810 10 648.73
Triton vLLM V1 1*H100 Qwen2.5-VL-7B-Instruct ShareGPT4V 1 381/122 1 1.49

All tests use streaming responses, and the benchmark duration refers to the total time required to complete all requests. As QPS increases, the Triton server takes significantly more time than vLLM serve. This raises the question of whether the additional latency is caused by the extra overhead in the Triton server’s vLLM backend when packaging responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions