-
Notifications
You must be signed in to change notification settings - Fork 890
Description
Hello !
When using sherpa-onnx-online-websocket-server with CUDA provider, GPU VRAM usage keeps increasing after handling multiple WebSocket connections.
Even after a client finishes streaming (Done message sent, final result returned, connection closed), VRAM allocated by the model is not released. Over time, this leads to out-of-memory (OOM) errors or forces the server process to crash/restart.
This issue makes it impossible to run the server under heavy load (hundreds of concurrent streams), since VRAM usage grows linearly with the number of completed connections.
./bin/sherpa-onnx-online-websocket-server \
--port=8080 \
--num-work-threads=16 \
--num-io-threads=8 \
--tokens=./models/tokens.txt \
--encoder=./models/encoder.onnx \
--decoder=./models/decoder.onnx \
--joiner=./models/joiner.onnx \
--provider=cuda \
--max-batch-size=128 \
--loop-interval-ms=10
Actual behavior
VRAM usage keeps increasing after each client disconnect.
Even though connections are removed from connections_ in OnlineWebsocketDecoder::ProcessConnections, the GPU memory is not freed.
Eventually the server hits OOM and restarts.
Environment
sherpa-onnx version: latest
Build type: The binaries from wget https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.13/sherpa-onnx-v1.12.13-cuda-12.x-cudnn-9.x-linux-x64-gpu.tar.bz2
CUDA version: 12.8
GPU: NVIDIA H100 80GB
OS: nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04
The issue seems related to OnlineRecognizer / OnlineStream not freeing GPU state after InputFinished + removal from connections_.
I tried adding manual cleanup (resetting stream, clearing connections), but VRAM still accumulates.
On CPU provider, the memory is released correctly.
On CUDA provider, VRAM grows continuously with each completed stream.