-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Hi, I encountered a problem - when I send requests to the model, it starts reserving memory on the GPU, in general there is nothing wrong with this, but after I stop sending requests - the memory is not cleared, moreover, if I start sending requests again, the memory continues to be reserved and not cleared. I use locust to send requests to the model, but the same situation happens if I use perf_analyzer, just used this example (https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer)
dali.py:
import nvidia.dali as dali
import nvidia.dali.types as types
from nvidia.dali.plugin.triton import autoserialize
IMAGE_SIZE = (512, 512)
MEAN = dali.fn.constant(fdata=[0.485, 0.456, 0.406], shape=(1, 1, 3)) * 255
STD = dali.fn.constant(fdata=[0.229, 0.224, 0.225], shape=(1, 1, 3)) * 255
@autoserialize
@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def pipe():
images = dali.fn.external_source(device="cpu", name="DALI_INPUT")
images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
images = dali.fn.resize(images, resize_x=IMAGE_SIZE[0], resize_y=IMAGE_SIZE[1], interp_type=types.INTERP_LINEAR)
images = dali.fn.normalize(images, scale=1, mean=MEAN, stddev=STD, shift=0)
images = dali.fn.transpose(images, perm=[2, 0, 1])
images = dali.fn.cast(images, dtype=types.FLOAT)
return images
config.pbtxt:
name: "dali"
backend: "dali"
max_batch_size: 8
dynamic_batching {
preferred_batch_size: [ 2, 4, 8 ]
max_queue_delay_microseconds: 1000
}
input [
{
name: "DALI_INPUT"
data_type: TYPE_UINT8
dims: [ -1 ]
allow_ragged_batch: true
}
]
output [
{
name: "DALI_OUTPUT"
data_type: TYPE_FP32
dims: [ 3, 512, 512 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
optimization {
graph : {
level : 1
}
}
model repo:
models/
├── dali
│ ├── 1
│ │ └── dali.py
│ └── config.pbtxt
I use Triton RELEASE=25.02
The command to run Triton:
docker build -t triton:dev -f ./docker/Dockerfile .
docker run --rm -it --init --shm-size=1GB -p 8000:8000 -p 8001:8001 -p 8002:8002 triton:dev
...
tritonserver \
--model-repository=/models \
--allow-metrics=1 \
--log-verbose=0 \
--metrics-config summary_latencies=true \
--pinned-memory-pool-byte-size=1073741824 \
--cuda-memory-pool-byte-size=0:1073741824
I noticed that when I change the parameter device="mixed"
to device="cpu"
at dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
, the problem disappears. But at the same time the RPS of the model falls.