Skip to content

Why DALI model accumulate GPU memory when sending requests? #265

@gjegoj

Description

@gjegoj

Hi, I encountered a problem - when I send requests to the model, it starts reserving memory on the GPU, in general there is nothing wrong with this, but after I stop sending requests - the memory is not cleared, moreover, if I start sending requests again, the memory continues to be reserved and not cleared. I use locust to send requests to the model, but the same situation happens if I use perf_analyzer, just used this example (https://github.com/triton-inference-server/dali_backend/tree/main/docs/examples/perf_analyzer)

dali.py:

import nvidia.dali as dali
import nvidia.dali.types as types
from nvidia.dali.plugin.triton import autoserialize

IMAGE_SIZE = (512, 512)
MEAN = dali.fn.constant(fdata=[0.485, 0.456, 0.406], shape=(1, 1, 3)) * 255
STD = dali.fn.constant(fdata=[0.229, 0.224, 0.225], shape=(1, 1, 3)) * 255


@autoserialize
@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="DALI_INPUT")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=IMAGE_SIZE[0], resize_y=IMAGE_SIZE[1], interp_type=types.INTERP_LINEAR)
    images = dali.fn.normalize(images, scale=1, mean=MEAN, stddev=STD, shift=0)
    images = dali.fn.transpose(images, perm=[2, 0, 1])
    images = dali.fn.cast(images, dtype=types.FLOAT)

    return images

config.pbtxt:

name: "dali"
backend: "dali"
max_batch_size: 8
dynamic_batching {
  preferred_batch_size: [ 2, 4, 8 ]
  max_queue_delay_microseconds: 1000
}
input [
    {
        name: "DALI_INPUT"
        data_type: TYPE_UINT8
        dims: [ -1 ]
        allow_ragged_batch: true
    }
]
output [
    {
        name: "DALI_OUTPUT"
        data_type: TYPE_FP32
        dims: [ 3, 512, 512 ]
    }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
optimization {
  graph : {
    level : 1
  }
}

model repo:

models/
├── dali
│   ├── 1
│   │   └── dali.py
│   └── config.pbtxt

I use Triton RELEASE=25.02
The command to run Triton:

docker build -t triton:dev -f ./docker/Dockerfile .
docker run --rm -it --init --shm-size=1GB -p 8000:8000 -p 8001:8001 -p 8002:8002 triton:dev
...
tritonserver \
		--model-repository=/models \
		--allow-metrics=1 \
		--log-verbose=0 \
		--metrics-config summary_latencies=true \
		--pinned-memory-pool-byte-size=1073741824 \
  		--cuda-memory-pool-byte-size=0:1073741824

before sending requests:
Image

after sending 20k requests:
Image

sending another 20k requests:
Image

I noticed that when I change the parameter device="mixed" to device="cpu" at dali.fn.decoders.image(images, device="mixed", output_type=types.RGB), the problem disappears. But at the same time the RPS of the model falls.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions