Skip to content

[Bug]: Internvl3-2B/8B can't inference with video input when combined with AsyncLLMEngine #25176

@JiancongWang

Description

@JiancongWang

Your current environment

I am using torch 2.7.1 + vllm 0.10.0 + transformers 4.55.4

🐛 Describe the bug

Hi, I am trying to use the internvl3-2B/8B (both have qwen 2.5 text backbone and support video input) to take in video input. That works out of the box on offline inference (I am testing out on this https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py#L594-L630). However, when I tried to deploy the model with AsyncLLMEngine, it complains about supported number of video input is 0. Here is my code for the AsyncLLMEngine

import numpy as np
from transformers import AutoModel, AutoTokenizer 
from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput
from vllm import __version__ as vllm_version
from vllm.entrypoints.chat_utils import (
    ChatCompletionContentPartParam,
    apply_hf_chat_template,
    parse_chat_messages,
    resolve_chat_template_content_format,
)
from vllm.inputs import TextPrompt
from vllm.sampling_params import GuidedDecodingParams, RequestOutputKind, SamplingParams
from vllm.utils import FlexibleArgumentParser

tokenizer = AutoTokenizer.from_pretrained(
            "OpenGVLab/InternVL3-8B",
            trust_remote_code=True,
            use_fast=False,
        )

args = ["--model", "OpenGVLab/InternVL3-8B",  *self.worker_config.args, "--max-model-len", "16k", "--guided-decoding-backend", "xgrammar", "--enable-prefix-caching"]
args_parser = AsyncEngineArgs.add_cli_args(FlexibleArgumentParser()) 
parsed_args = args_parser.parse_args(args)

engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
engine_args.limit_mm_per_prompt = {"image": 10, "video": 10}
llm_engine = AsyncLLMEngine.from_engine_args(engine_args)

_question = "Is this video of good quality? Answer Yes or No only"
messages = [{'role': 'user', 'content': f"<video>\n{_question}"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


single_input = {
                "prompt": prompt,
                "multi_modal_data": {
                    "video": np.zeros((10,448,448, 3), dtype=np.uint8) # any video data put here is ok
                },
            }

sampling_params = SamplingParams(
       temperature=self.temperature,
       max_tokens=1,
       logprobs=10,
)

request_id = "0000"

response_generator = llm_engine.generate(single_input, sampling_params=sampling_params, request_id=request_id)

response = await response_generator.__anext__()

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions