The model inference time is inconsistent

**Description**
I use ensemble and python_backend. After 10 warmup with random data, The same data in the first inference time especially long, the second time to reach the expected time.

**Triton Information**
What version of Triton are you using?
24.05-py3

Are you using the Triton container or did you build it yourself?
Installed the necessary python libraries and then built.

**To Reproduce**
Steps to reproduce the behavior.
1. start command： tritonserver --model-repository=/models  --cuda-memory-pool-byte-size=0:1024000000
2. warmup： inference( np.random.uniform(size=size) )
3. run：I use 10 pieces of data,and use them randomly. for da in np.random.choice(data, len(data),replace=False):inference(da)


Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

**Expected behavior**
A clear and concise description of what you expected to happen.
I found that when the triton service was newly started, it took a long time to inference about each piece of data for the first time. But after the first inference, do the same 10 data inferences and each one is fast. For example, when I first inference one new data, it takes 20 seconds. But the second time only takes 0.1s.
I use V100.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The model inference time is inconsistent #8379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The model inference time is inconsistent #8379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions