-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
I am using torch 2.7.1 + vllm 0.10.0 + transformers 4.55.4
🐛 Describe the bug
Hi, I am trying to use the internvl3-2B/8B (both have qwen 2.5 text backbone and support video input) to take in video input. That works out of the box on offline inference (I am testing out on this https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/vision_language.py#L594-L630). However, when I tried to deploy the model with AsyncLLMEngine, it complains about supported number of video input is 0. Here is my code for the AsyncLLMEngine
import numpy as np
from transformers import AutoModel, AutoTokenizer
from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput
from vllm import __version__ as vllm_version
from vllm.entrypoints.chat_utils import (
ChatCompletionContentPartParam,
apply_hf_chat_template,
parse_chat_messages,
resolve_chat_template_content_format,
)
from vllm.inputs import TextPrompt
from vllm.sampling_params import GuidedDecodingParams, RequestOutputKind, SamplingParams
from vllm.utils import FlexibleArgumentParser
tokenizer = AutoTokenizer.from_pretrained(
"OpenGVLab/InternVL3-8B",
trust_remote_code=True,
use_fast=False,
)
args = ["--model", "OpenGVLab/InternVL3-8B", *self.worker_config.args, "--max-model-len", "16k", "--guided-decoding-backend", "xgrammar", "--enable-prefix-caching"]
args_parser = AsyncEngineArgs.add_cli_args(FlexibleArgumentParser())
parsed_args = args_parser.parse_args(args)
engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
engine_args.limit_mm_per_prompt = {"image": 10, "video": 10}
llm_engine = AsyncLLMEngine.from_engine_args(engine_args)
_question = "Is this video of good quality? Answer Yes or No only"
messages = [{'role': 'user', 'content': f"<video>\n{_question}"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
single_input = {
"prompt": prompt,
"multi_modal_data": {
"video": np.zeros((10,448,448, 3), dtype=np.uint8) # any video data put here is ok
},
}
sampling_params = SamplingParams(
temperature=self.temperature,
max_tokens=1,
logprobs=10,
)
request_id = "0000"
response_generator = llm_engine.generate(single_input, sampling_params=sampling_params, request_id=request_id)
response = await response_generator.__anext__()
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working