Why DALI model accumulate GPU memory when sending requests?

Hi, I encountered a problem - when I send requests to the model, it starts reserving memory on the GPU, in general there is nothing wrong with this, but after I stop sending requests - the memory is not cleared, moreover, if I start sending requests again, the memory continues to be reserved and not cleared. I use [locust](https://github.com/locustio/locust) to send requests to the model, but the same situation happens if I use perf_analyzer, just used this example (https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer)

***dali.py***:
```
import nvidia.dali as dali
import nvidia.dali.types as types
from nvidia.dali.plugin.triton import autoserialize

IMAGE_SIZE = (512, 512)
MEAN = dali.fn.constant(fdata=[0.485, 0.456, 0.406], shape=(1, 1, 3)) * 255
STD = dali.fn.constant(fdata=[0.229, 0.224, 0.225], shape=(1, 1, 3)) * 255


@autoserialize
@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="DALI_INPUT")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=IMAGE_SIZE[0], resize_y=IMAGE_SIZE[1], interp_type=types.INTERP_LINEAR)
    images = dali.fn.normalize(images, scale=1, mean=MEAN, stddev=STD, shift=0)
    images = dali.fn.transpose(images, perm=[2, 0, 1])
    images = dali.fn.cast(images, dtype=types.FLOAT)

    return images
```

***config.pbtxt***:

```
name: "dali"
backend: "dali"
max_batch_size: 8
dynamic_batching {
  preferred_batch_size: [ 2, 4, 8 ]
  max_queue_delay_microseconds: 1000
}
input [
    {
        name: "DALI_INPUT"
        data_type: TYPE_UINT8
        dims: [ -1 ]
        allow_ragged_batch: true
    }
]
output [
    {
        name: "DALI_OUTPUT"
        data_type: TYPE_FP32
        dims: [ 3, 512, 512 ]
    }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
optimization {
  graph : {
    level : 1
  }
}
```

***model repo***:
```
models/
├── dali
│   ├── 1
│   │   └── dali.py
│   └── config.pbtxt
```
I use Triton RELEASE=25.02
***The command to run Triton***:
```
docker build -t triton:dev -f ./docker/Dockerfile .
docker run --rm -it --init --shm-size=1GB -p 8000:8000 -p 8001:8001 -p 8002:8002 triton:dev
```
```
...
tritonserver \
		--model-repository=/models \
		--allow-metrics=1 \
		--log-verbose=0 \
		--metrics-config summary_latencies=true \
		--pinned-memory-pool-byte-size=1073741824 \
  		--cuda-memory-pool-byte-size=0:1073741824
```
***before sending requests***:
<img width="1039" alt="Image" src="https://github.com/user-attachments/assets/145a82b6-ba79-4feb-86bd-15fa12138308" />

***after sending 20k requests***:
<img width="1056" alt="Image" src="https://github.com/user-attachments/assets/df41e035-5b2b-4b81-9191-4f4f86453080" />

***sending another 20k requests***:
<img width="1044" alt="Image" src="https://github.com/user-attachments/assets/05414b65-a089-4b19-946b-0f5a59d8c46c" />


I noticed that when I change the parameter ```device="mixed"``` to ```device="cpu"``` at ```dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)```, the problem disappears. But at the same time the RPS of the model falls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why DALI model accumulate GPU memory when sending requests? #265

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why DALI model accumulate GPU memory when sending requests? #265

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions