-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description
I use ensemble and python_backend. After 10 warmup with random data, The same data in the first inference time especially long, the second time to reach the expected time.
Triton Information
What version of Triton are you using?
24.05-py3
Are you using the Triton container or did you build it yourself?
Installed the necessary python libraries and then built.
To Reproduce
Steps to reproduce the behavior.
- start command: tritonserver --model-repository=/models --cuda-memory-pool-byte-size=0:1024000000
- warmup: inference( np.random.uniform(size=size) )
- run:I use 10 pieces of data,and use them randomly. for da in np.random.choice(data, len(data),replace=False):inference(da)
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Expected behavior
A clear and concise description of what you expected to happen.
I found that when the triton service was newly started, it took a long time to inference about each piece of data for the first time. But after the first inference, do the same 10 data inferences and each one is fast. For example, when I first inference one new data, it takes 20 seconds. But the second time only takes 0.1s.
I use V100.