You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing/profiling.md
+19-27Lines changed: 19 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,15 +39,15 @@ Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline
39
39
40
40
```bash
41
41
VLLM_TORCH_PROFILER_DIR=./vllm_profile \
42
-
vllm serve meta-llama/Meta-Llama-3-70B
42
+
vllm serve meta-llama/Llama-3.1-8B-Instruct
43
43
```
44
44
45
45
vllm bench command:
46
46
47
47
```bash
48
48
vllm bench serve \
49
49
--backend vllm \
50
-
--model meta-llama/Meta-Llama-3-70B \
50
+
--model meta-llama/Llama-3.1-8B-Instruct \
51
51
--dataset-name sharegpt \
52
52
--dataset-path sharegpt.json \
53
53
--profile \
@@ -70,18 +70,21 @@ apt update
70
70
apt install nsight-systems-cli
71
71
```
72
72
73
-
### Example commands and usage
73
+
!!! tip
74
+
When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues).
74
75
75
-
When profiling with `nsys`, it is advisable to set the environment variable `VLLM_WORKER_MULTIPROC_METHOD=spawn`. The default is to use the `fork` method instead of `spawn`. More information on the topic can be found in the [Nsight Systems release notes](https://docs.nvidia.com/nsight-systems/ReleaseNotes/index.html#general-issues).
76
+
The Nsight Systems profiler can be launched with `nsys profile ...`, with a few recommended flags for vLLM: `--trace-fork-before-exec=true --cuda-graph-trace=node`.
77
+
78
+
### Example commands and usage
76
79
77
80
#### Offline Inference
78
81
79
-
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
82
+
For basic usage, you can just append the profiling command before any existing script you would run for offline inference.
80
83
81
84
The following is an example using the `vllm bench latency` script:
82
85
83
86
```bash
84
-
nsys profile -o report.nsys-rep \
87
+
nsys profile \
85
88
--trace-fork-before-exec=true \
86
89
--cuda-graph-trace=node \
87
90
vllm bench latency \
@@ -95,40 +98,29 @@ vllm bench latency \
95
98
96
99
#### OpenAI Server
97
100
98
-
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, however you must specify `--delay XX --duration YY` parameters according to the needs of your benchmark. After the duration time has been used up, the server will be killed.
101
+
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, but you will need to specify a few other arguments to enable dynamic capture similarly to the Torch Profiler:
99
102
100
103
```bash
101
104
# server
102
-
nsys profile -o report.nsys-rep \
105
+
VLLM_TORCH_CUDA_PROFILE=1 \
106
+
nsys profile \
103
107
--trace-fork-before-exec=true \
104
108
--cuda-graph-trace=node \
105
-
--delay 30 \
106
-
--duration 60 \
109
+
--capture-range=cudaProfilerApi \
110
+
--capture-range-end repeat \
107
111
vllm serve meta-llama/Llama-3.1-8B-Instruct
108
112
109
113
# client
110
114
vllm bench serve \
111
115
--backend vllm \
112
116
--model meta-llama/Llama-3.1-8B-Instruct \
113
-
--num-prompts 1 \
114
-
--dataset-name random \
115
-
--random-input 1024 \
116
-
--random-output 512
117
-
```
118
-
119
-
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
120
-
121
-
```bash
122
-
nsys sessions list
123
-
```
124
-
125
-
to get the session id in the form of `profile-XXXXX`, then run:
126
-
127
-
```bash
128
-
nsys stop --session=profile-XXXXX
117
+
--dataset-name sharegpt \
118
+
--dataset-path sharegpt.json \
119
+
--profile \
120
+
--num-prompts 2
129
121
```
130
122
131
-
to manually kill the profiler and generate your `nsys-rep` report.
123
+
With `--profile`, vLLM will capture a profile for each run of `vllm bench serve`. Once the server is killed, the profiles will all be saved.
0 commit comments