Skip to content

The model inference time is inconsistent #8379

@wsd12345

Description

@wsd12345

Description
I use ensemble and python_backend. After 10 warmup with random data, The same data in the first inference time especially long, the second time to reach the expected time.

Triton Information
What version of Triton are you using?
24.05-py3

Are you using the Triton container or did you build it yourself?
Installed the necessary python libraries and then built.

To Reproduce
Steps to reproduce the behavior.

  1. start command: tritonserver --model-repository=/models --cuda-memory-pool-byte-size=0:1024000000
  2. warmup: inference( np.random.uniform(size=size) )
  3. run:I use 10 pieces of data,and use them randomly. for da in np.random.choice(data, len(data),replace=False):inference(da)

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior
A clear and concise description of what you expected to happen.
I found that when the triton service was newly started, it took a long time to inference about each piece of data for the first time. But after the first inference, do the same 10 data inferences and each one is fast. For example, when I first inference one new data, it takes 20 seconds. But the second time only takes 0.1s.
I use V100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions