Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions examples/Vstar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# VStar Example

VStar is a workflow that utilizes visual contextual information to process high-definition images. It utilizes hierarchical image structures and adaptive thresholding to efficiently identify objects relevant to user queries.

This example demonstrates how to use the OMAgent framework for visual search and analysis tasks. The example code can be found in the "examples/VStar" directory.

```bash
cd examples/VStar
```

## Overview

This example implements a comprehensive VStar workflow that consists of the following components:

1. **VStar Input**
- Handles user input containing text queries and image uploads
- Processes multi-modal inputs to prepare for visual analysis

2. **Vstar Workflow**
- Determine the elements needed to answer the question
- Performing confidence-guided searches to localize visual elements
- Optimize search efficiency using adaptive thresholding

### This workflow is structured as follows:

<img src="./docs/images/vstar_workflow.jpg" alt="VStar Workflow" width="500" height="auto">

## Prerequisites

- Python 3.11+
- Required packages installed (see requirements.txt)
- Access to a multimodal LLM (e.g., LLaVA, GPT-4V) or compatible endpoint
- Redis server running locally or remotely (for pro mode)
- Conductor server running locally or remotely (for pro mode)

## Configuration

The `container.yaml` file manages dependencies and settings for different components of the system. To set up your configuration:

1. Generate the `container.yaml` file:
```bash
python compile_container.py
```
This will create a `container.yaml` file with default settings under `examples/VStar`.

2. Configure your multimodal LLM settings in `configs/llms/*.yml`:
- Set your model endpoint through environment variables or by directly modifying the yml file
```bash
export custom_vstar_endpoint="your_vstar_endpoint"
```

- Configure other model settings like temperature as needed.

3. Update settings in the generated `container.yaml`:
- Modify Redis connection settings (for pro mode):
- Set the host, port, and credentials for your Redis instance.
- Configure both `redis_stream_client` and `redis_stm_client` sections.
- Update the Conductor server URL under the conductor_config section (for pro mode).
- Adjust any other component settings as needed.

## Running the Example

Run the VStar example:

For terminal/CLI usage:
```bash
python run_cli.py
```

You can run the VStar workflow in `pro` mode or `lite` mode by changing the `OMAGENT_MODE` environment variable. The default mode is `pro`, which uses the conductor and Redis server. The `lite` mode will run the workflow in the current Python process without external services.

For pro mode:
```bash
export OMAGENT_MODE="pro"
python run_cli.py
```

For lite mode:
```bash
export OMAGENT_MODE="lite"
python run_cli.py
```

## How VStar Works

VStar uses a hierarchical approach to image analysis:

1. The image is first processed to extract relevant features and prepare for analysis.
2. Visual cues are generated from the user's query to guide the search.
3. A confidence-guided search algorithm traverses the image data to locate visual elements.
4. Adaptive thresholding ensures high-quality results while optimizing computation.
5. Found elements are synthesized into a comprehensive answer.

This approach enables precise localization of visual elements while maintaining computational efficiency.

## Troubleshooting

If you encounter issues:
- Verify your multimodal LLM endpoint is accessible and working.
- For pro mode, confirm Redis is running and accessible.
- Ensure all dependencies are installed correctly.
- Check for sufficient GPU resources if using local model deployment.
- Review logs for any error messages.
- **Open an issue on GitHub if you can't find a solution; we will do our best to help you out!**

## Local deployment of vstar

Since vstar does not yet support deployment by vllm, for example, we need to deploy locally.
First of all, go to V*'s code repository and download the [source code](https://github.com/penghao-wu/vstar)
, then copy the python file `OmAgent/examples/Vstar/docs/files/vstar_api.py` for deploying the api to the vstar source folder, and change the model path to your download seal models. Finally run `uvicorn vstar_api:app --host 0.0.0.0 --port 8000` to start the service, and then `export custom_vstar_endpoint=http://localhost:8000/`.
Empty file.
70 changes: 70 additions & 0 deletions examples/Vstar/agent/vstar_input/vstar_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
from pathlib import Path

from omagent_core.engine.worker.base import BaseWorker
from omagent_core.utils.logger import logging
from omagent_core.utils.registry import registry

CURRENT_PATH = Path(__file__).parents[0]


@registry.register_worker()
class VstarInput(BaseWorker):
"""
A worker class for handling VStar input processing.
This class is responsible for receiving and processing user input,
including both text queries and image data.
"""

def _run(self, *args, **kwargs):
"""
Main execution method for processing VStar input.

This method handles:
1. Reading user input through the input interface
2. Extracting messages from the input
3. Processing both image and text content

Returns:
dict: A dictionary containing:
- 'query': The text query from the user
- 'image_path': The path to the uploaded image

Raises:
Exception: If any error occurs during input processing
"""
try:
# Request user input through the designated input interface
user_input = self.input.read_input(
workflow_instance_id=self.workflow_instance_id,
input_prompt="Please input your question:",
)

# Extract the message list from user input
messages = user_input["messages"]
# Get the most recent message (last message in the list)
message = messages[-1]

# Initialize variables to store image and text data
image_path = None
text = None

# Iterate through each content item in the message
for each_content in message["content"]:
if each_content["type"] == "image_url":
# If content is an image, store its path
image_path = each_content["data"]
elif each_content["type"] == "text":
# If content is text, store the text query
text = each_content["data"]

# Return a dictionary containing both the text query and image path
return {
"query": text,
"image_path": image_path
}

except Exception as e:
# Log any errors that occur during the input processing
logging.error(f"Error in ToT input processing: {str(e)}")
# Re-raise the exception for proper error handling upstream
raise
20 changes: 20 additions & 0 deletions examples/Vstar/compile_container.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from omagent_core.utils.container import container
from pathlib import Path
from omagent_core.utils.registry import registry


# Load all registered workflow components
registry.import_module()

# Configure import path for agent modules
from pathlib import Path
CURRENT_PATH = Path(__file__).parents[0]

# Register core workflow components for state management, callbacks and input handling
container.register_stm(stm='SharedMemSTM')
container.register_callback(callback='AppCallback')
container.register_input(input='AppInput')


# Compile container config
container.compile_config(CURRENT_PATH)
2 changes: 2 additions & 0 deletions examples/Vstar/configs/llms/vstar.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
name: VStarLLM
endpoint: ${env| custom_vstar_endpoint, https://vstar.om-ai.com}
1 change: 1 addition & 0 deletions examples/Vstar/configs/workers/vstar_input.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
name: VstarInput
16 changes: 16 additions & 0 deletions examples/Vstar/configs/workers/vstar_workflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
- name: VQA_LLM_Preprocess

- name: VstarLoopCheck

- name: VQA_LLM
llm: ${sub|vstar}

- name: VstarSearchPreprocess

- name: VstarSearch
llm: ${sub|vstar}

- name: VstarSearchCheck

- name: VQA_LLM_Post
llm: ${sub|vstar}
126 changes: 126 additions & 0 deletions examples/Vstar/docs/files/vstar_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
import base64
from fastapi import FastAPI
from pydantic import BaseModel
from io import BytesIO
from PIL import Image
import numpy as np
from dataclasses import dataclass
from vstar_bench_eval import VQA_LLM, expand2square, normalize_bbox
from visual_search import VSM
import torch
from typing import List

@dataclass
class Args:
vqa_model_path: str = None
version: str = None
conv_type: str = None
vision_tower: str = None
vqa_args = Args(
vqa_model_path="./seal_vqa_7b", # your path to seal_vqa_7b
conv_type="v1"
)
print(f"Using model path: {vqa_args.vqa_model_path}")
vqa_llm = VQA_LLM(vqa_args)

vsm_args = Args(
version="./seal_vsm_7b", # your path to seal_vsm_7b
vision_tower="./clip-vit-large-patch14" # your path to clip-vit-large-patch14
)
vsm = VSM(vsm_args)

app = FastAPI()

class VQAOutput(BaseModel):
generated_text: str

class VQAPayload(BaseModel):
prompt: str
image_base64: str

@app.post("/vqa_llm")
async def generate_from_base64(data: VQAPayload):
prompt = data.prompt
image_base64 = data.image_base64
image_data = base64.b64decode(image_base64)
image = Image.open(BytesIO(image_data))
response = vqa_llm.free_form_inference(image, prompt)

return VQAOutput(
generated_text=response,
)

class VSMOutput(BaseModel):
response: object

class VSMPayload(BaseModel):
prompt: str
image_base64: str
mode: str

@app.post("/visual_search_model")
async def generate_vsm(data: VSMPayload):
prompt = data.prompt
image_base64 = data.image_base64
image_data = base64.b64decode(image_base64)
image = Image.open(BytesIO(image_data))
mode = data.mode
response = vsm.inference(image, prompt, mode)
print(response)
if mode == 'segmentation':
response = response.cpu().tolist()
elif mode == 'vqa':
pass
elif mode == 'detection':
response = [r.cpu().tolist() for r in response]
return VSMOutput(
response=response,
)

class VQAPostPayload(BaseModel):
prompt: str
image_base64: str
bboxes: List[List[int]]
object_names: List[str]

@app.post("/vqa_llm_post")
async def generate_from_base64(data: VQAPostPayload):
prompt = data.prompt
image_base64 = data.image_base64
bboxes = data.bboxes
object_names = data.object_names
image_data = base64.b64decode(image_base64)
image = Image.open(BytesIO(image_data))

if len(object_names) <= 2:
images_long = [False]
objects_long = [True]*len(object_names)
else:
images_long = [False]
objects_long = [False]*len(object_names)
object_crops = []
for bbox in bboxes:
object_crop = vqa_llm.get_object_crop(image, bbox, patch_scale=1.2)
object_crops.append(object_crop)
object_crops = torch.stack(object_crops, 0)
image, left, top = expand2square(image, tuple(int(x*255) for x in vqa_llm.image_processor.image_mean))
bbox_list = []
for bbox in bboxes:
bbox[0] += left
bbox[1] += top
bbox_list.append(bbox)
bbox_list = [normalize_bbox(bbox, image.width, image.height) for bbox in bbox_list]
focus_msg = "Additional visual information to focus on: "
cur_focus_msg = focus_msg
for i, (object_name, bbox) in enumerate(zip(object_names, bbox_list)):
cur_focus_msg = cur_focus_msg + "{} <object> at location [{:.3f},{:.3f},{:.3f},{:.3f}]".format(object_name, bbox[0], bbox[1], bbox[2], bbox[3])
if i != len(bbox_list)-1:
cur_focus_msg = cur_focus_msg+"; "
else:
cur_focus_msg = cur_focus_msg +'.'
question_with_focus = cur_focus_msg+"\n"+prompt

response = vqa_llm.free_form_inference(image, question_with_focus, object_crops=object_crops, objects_long=objects_long, images_long=images_long)
return VQAOutput(
generated_text=response,
)
3 changes: 3 additions & 0 deletions examples/Vstar/docs/images/vstar_workflow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading