[Bug]: Internvl3-2B/8B can't inference with video input when combined with AsyncLLMEngine

### Your current environment

<details>


```text
I am using torch 2.7.1 + vllm 0.10.0 + transformers 4.55.4
```

</details>


### 🐛 Describe the bug

Hi, I am trying to use the internvl3-2B/8B (both have qwen 2.5 text backbone and support video input) to take in video input. That works out of the box on offline inference (I am testing out on this https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py#L594-L630). However, when I tried to deploy the model with AsyncLLMEngine, it complains about supported number of video input is 0. Here is my code for the AsyncLLMEngine


```
import numpy as np
from transformers import AutoModel, AutoTokenizer 
from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput
from vllm import __version__ as vllm_version
from vllm.entrypoints.chat_utils import (
    ChatCompletionContentPartParam,
    apply_hf_chat_template,
    parse_chat_messages,
    resolve_chat_template_content_format,
)
from vllm.inputs import TextPrompt
from vllm.sampling_params import GuidedDecodingParams, RequestOutputKind, SamplingParams
from vllm.utils import FlexibleArgumentParser

tokenizer = AutoTokenizer.from_pretrained(
            "OpenGVLab/InternVL3-8B",
            trust_remote_code=True,
            use_fast=False,
        )

args = ["--model", "OpenGVLab/InternVL3-8B",  *self.worker_config.args, "--max-model-len", "16k", "--guided-decoding-backend", "xgrammar", "--enable-prefix-caching"]
args_parser = AsyncEngineArgs.add_cli_args(FlexibleArgumentParser()) 
parsed_args = args_parser.parse_args(args)

engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
engine_args.limit_mm_per_prompt = {"image": 10, "video": 10}
llm_engine = AsyncLLMEngine.from_engine_args(engine_args)

_question = "Is this video of good quality? Answer Yes or No only"
messages = [{'role': 'user', 'content': f"<video>\n{_question}"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


single_input = {
                "prompt": prompt,
                "multi_modal_data": {
                    "video": np.zeros((10,448,448, 3), dtype=np.uint8) # any video data put here is ok
                },
            }

sampling_params = SamplingParams(
       temperature=self.temperature,
       max_tokens=1,
       logprobs=10,
)

request_id = "0000"

response_generator = llm_engine.generate(single_input, sampling_params=sampling_params, request_id=request_id)

response = await response_generator.__anext__()
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Internvl3-2B/8B can't inference with video input when combined with AsyncLLMEngine #25176

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Internvl3-2B/8B can't inference with video input when combined with AsyncLLMEngine #25176

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions