om-ai-lab · fourfireM · Mar 27, 2025
diff --git a/examples/Vstar/README.md b/examples/Vstar/README.md
@@ -0,0 +1,110 @@
+# VStar Example
+
+VStar is a workflow that utilizes visual contextual information to process high-definition images. It utilizes hierarchical image structures and adaptive thresholding to efficiently identify objects relevant to user queries.
+
+This example demonstrates how to use the OMAgent framework for visual search and analysis tasks. The example code can be found in the "examples/VStar" directory.
+
+```bash
+cd examples/VStar
+```
+
+## Overview
+
+This example implements a comprehensive VStar workflow that consists of the following components:
+
+1. **VStar Input**
+   - Handles user input containing text queries and image uploads
+   - Processes multi-modal inputs to prepare for visual analysis
+
+2. **Vstar Workflow**
+   - Determine the elements needed to answer the question
+   - Performing confidence-guided searches to localize visual elements
+   - Optimize search efficiency using adaptive thresholding
+
+### This workflow is structured as follows:
+
+<img src="./docs/images/vstar_workflow.jpg" alt="VStar Workflow" width="500" height="auto">
+
+## Prerequisites
+
+- Python 3.11+
+- Required packages installed (see requirements.txt)
+- Access to a multimodal LLM (e.g., LLaVA, GPT-4V) or compatible endpoint 
+- Redis server running locally or remotely (for pro mode)
+- Conductor server running locally or remotely (for pro mode)
+
+## Configuration
+
+The `container.yaml` file manages dependencies and settings for different components of the system. To set up your configuration:
+
+1. Generate the `container.yaml` file:
+   ```bash
+   python compile_container.py
+   ```
+   This will create a `container.yaml` file with default settings under `examples/VStar`.
+
+2. Configure your multimodal LLM settings in `configs/llms/*.yml`:
+   - Set your model endpoint through environment variables or by directly modifying the yml file
+   ```bash
+   export custom_vstar_endpoint="your_vstar_endpoint"
+   ```
+
+   - Configure other model settings like temperature as needed.
+
+3. Update settings in the generated `container.yaml`:
+   - Modify Redis connection settings (for pro mode):
+     - Set the host, port, and credentials for your Redis instance.
+     - Configure both `redis_stream_client` and `redis_stm_client` sections.
+   - Update the Conductor server URL under the conductor_config section (for pro mode).
+   - Adjust any other component settings as needed.
+
+## Running the Example
+
+Run the VStar example:
+
+For terminal/CLI usage:
+```bash
+python run_cli.py
+```
+
+You can run the VStar workflow in `pro` mode or `lite` mode by changing the `OMAGENT_MODE` environment variable. The default mode is `pro`, which uses the conductor and Redis server. The `lite` mode will run the workflow in the current Python process without external services.
+
+For pro mode:
+```bash
+export OMAGENT_MODE="pro"
+python run_cli.py
+```
+
+For lite mode:
+```bash
+export OMAGENT_MODE="lite"
+python run_cli.py
+```
+
+## How VStar Works
+
+VStar uses a hierarchical approach to image analysis:
+
+1. The image is first processed to extract relevant features and prepare for analysis.
+2. Visual cues are generated from the user's query to guide the search.
+3. A confidence-guided search algorithm traverses the image data to locate visual elements.
+4. Adaptive thresholding ensures high-quality results while optimizing computation.
+5. Found elements are synthesized into a comprehensive answer.
+
+This approach enables precise localization of visual elements while maintaining computational efficiency.
+
+## Troubleshooting
+
+If you encounter issues:
+- Verify your multimodal LLM endpoint is accessible and working.
+- For pro mode, confirm Redis is running and accessible.
+- Ensure all dependencies are installed correctly.
+- Check for sufficient GPU resources if using local model deployment.
+- Review logs for any error messages.
+- **Open an issue on GitHub if you can't find a solution; we will do our best to help you out!**
+
+## Local deployment of vstar
+
+Since vstar does not yet support deployment by vllm, for example, we need to deploy locally.
+First of all, go to V*'s code repository and download the [source code](https://github.com/penghao-wu/vstar)
+, then copy the python file `OmAgent/examples/Vstar/docs/files/vstar_api.py` for deploying the api to the vstar source folder, and change the model path to your download seal models. Finally run `uvicorn vstar_api:app --host 0.0.0.0 --port 8000` to start the service, and then `export custom_vstar_endpoint=http://localhost:8000/`.
diff --git a/examples/Vstar/agent/vstar_input/__init__.py b/examples/Vstar/agent/vstar_input/__init__.py
diff --git a/examples/Vstar/agent/vstar_input/vstar_input.py b/examples/Vstar/agent/vstar_input/vstar_input.py
@@ -0,0 +1,70 @@
+from pathlib import Path
+
+from omagent_core.engine.worker.base import BaseWorker
+from omagent_core.utils.logger import logging
+from omagent_core.utils.registry import registry
+
+CURRENT_PATH = Path(__file__).parents[0]
+
+
+@registry.register_worker()
+class VstarInput(BaseWorker):
+    """
+    A worker class for handling VStar input processing.
+    This class is responsible for receiving and processing user input,
+    including both text queries and image data.
+    """
+
+    def _run(self, *args, **kwargs):
+        """
+        Main execution method for processing VStar input.
+
+        This method handles:
+        1. Reading user input through the input interface
+        2. Extracting messages from the input
+        3. Processing both image and text content
+
+        Returns:
+            dict: A dictionary containing:
+                - 'query': The text query from the user
+                - 'image_path': The path to the uploaded image
+
+        Raises:
+            Exception: If any error occurs during input processing
+        """
+        try:
+            # Request user input through the designated input interface
+            user_input = self.input.read_input(
+                workflow_instance_id=self.workflow_instance_id,
+                input_prompt="Please input your question:",
+            )
+
+            # Extract the message list from user input
+            messages = user_input["messages"]
+            # Get the most recent message (last message in the list)
+            message = messages[-1]
+
+            # Initialize variables to store image and text data
+            image_path = None
+            text = None
+
+            # Iterate through each content item in the message
+            for each_content in message["content"]:
+                if each_content["type"] == "image_url":
+                    # If content is an image, store its path
+                    image_path = each_content["data"]
+                elif each_content["type"] == "text":
+                    # If content is text, store the text query
+                    text = each_content["data"]
+
+            # Return a dictionary containing both the text query and image path
+            return {
+                "query": text,
+                "image_path": image_path
+            }
+
+        except Exception as e:
+            # Log any errors that occur during the input processing
+            logging.error(f"Error in ToT input processing: {str(e)}")
+            # Re-raise the exception for proper error handling upstream
+            raise
diff --git a/examples/Vstar/compile_container.py b/examples/Vstar/compile_container.py
@@ -0,0 +1,20 @@
+from omagent_core.utils.container import container
+from pathlib import Path
+from omagent_core.utils.registry import registry
+
+
+# Load all registered workflow components
+registry.import_module()
+
+# Configure import path for agent modules
+from pathlib import Path
+CURRENT_PATH = Path(__file__).parents[0]
+
+# Register core workflow components for state management, callbacks and input handling
+container.register_stm(stm='SharedMemSTM')
+container.register_callback(callback='AppCallback')
+container.register_input(input='AppInput')
+
+
+# Compile container config
+container.compile_config(CURRENT_PATH)
diff --git a/examples/Vstar/configs/llms/vstar.yml b/examples/Vstar/configs/llms/vstar.yml
@@ -0,0 +1,2 @@
+name: VStarLLM
+endpoint: ${env| custom_vstar_endpoint, https://vstar.om-ai.com}
diff --git a/examples/Vstar/configs/workers/vstar_input.yaml b/examples/Vstar/configs/workers/vstar_input.yaml
@@ -0,0 +1 @@
+name: VstarInput
diff --git a/examples/Vstar/configs/workers/vstar_workflow.yaml b/examples/Vstar/configs/workers/vstar_workflow.yaml
@@ -0,0 +1,16 @@
+- name: VQA_LLM_Preprocess
+
+- name: VstarLoopCheck
+
+- name: VQA_LLM
+  llm: ${sub|vstar}
+
+- name: VstarSearchPreprocess
+
+- name: VstarSearch
+  llm: ${sub|vstar}
+
+- name: VstarSearchCheck
+
+- name: VQA_LLM_Post
+  llm: ${sub|vstar}
diff --git a/examples/Vstar/docs/files/vstar_api.py b/examples/Vstar/docs/files/vstar_api.py
@@ -0,0 +1,126 @@
+import base64
+from fastapi import FastAPI
+from pydantic import BaseModel
+from io import BytesIO
+from PIL import Image
+import numpy as np
+from dataclasses import dataclass
+from vstar_bench_eval import VQA_LLM, expand2square, normalize_bbox
+from visual_search import VSM
+import torch
+from typing import List
+
+@dataclass
+class Args:
+    vqa_model_path: str = None
+    version: str = None
+    conv_type: str = None
+    vision_tower: str = None
+vqa_args = Args(
+    vqa_model_path="./seal_vqa_7b", # your path to seal_vqa_7b
+    conv_type="v1"
+)
+print(f"Using model path: {vqa_args.vqa_model_path}")
+vqa_llm = VQA_LLM(vqa_args)
+
+vsm_args = Args(
+    version="./seal_vsm_7b", # your path to seal_vsm_7b
+    vision_tower="./clip-vit-large-patch14" # your path to clip-vit-large-patch14
+)
+vsm = VSM(vsm_args)
+
+app = FastAPI()
+
+class VQAOutput(BaseModel):
+    generated_text: str
+
+class VQAPayload(BaseModel):
+    prompt: str
+    image_base64: str
+
+@app.post("/vqa_llm")
+async def generate_from_base64(data: VQAPayload):
+    prompt = data.prompt
+    image_base64 = data.image_base64
+    image_data = base64.b64decode(image_base64)
+    image = Image.open(BytesIO(image_data))
+    response = vqa_llm.free_form_inference(image, prompt)
+
+    return VQAOutput(
+        generated_text=response,
+    )
+
+class VSMOutput(BaseModel):
+    response: object
+
+class VSMPayload(BaseModel):
+    prompt: str
+    image_base64: str
+    mode: str
+
+@app.post("/visual_search_model")
+async def generate_vsm(data: VSMPayload):
+    prompt = data.prompt
+    image_base64 = data.image_base64
+    image_data = base64.b64decode(image_base64)
+    image = Image.open(BytesIO(image_data))
+    mode = data.mode
+    response = vsm.inference(image, prompt, mode)
+    print(response)
+    if mode == 'segmentation':
+        response = response.cpu().tolist()
+    elif mode == 'vqa':
+        pass
+    elif mode == 'detection':
+        response = [r.cpu().tolist() for r in response]
+    return VSMOutput(
+        response=response,
+    )
+
+class VQAPostPayload(BaseModel):
+    prompt: str
+    image_base64: str
+    bboxes: List[List[int]]
+    object_names: List[str]
+
+@app.post("/vqa_llm_post")
+async def generate_from_base64(data: VQAPostPayload):
+    prompt = data.prompt
+    image_base64 = data.image_base64
+    bboxes = data.bboxes
+    object_names = data.object_names
+    image_data = base64.b64decode(image_base64)
+    image = Image.open(BytesIO(image_data))
+
+    if len(object_names) <= 2:
+        images_long = [False]
+        objects_long = [True]*len(object_names)
+    else:
+        images_long = [False]
+        objects_long = [False]*len(object_names)
+    object_crops = []
+    for bbox in bboxes:
+        object_crop = vqa_llm.get_object_crop(image, bbox, patch_scale=1.2)
+        object_crops.append(object_crop)
+    object_crops = torch.stack(object_crops, 0)
+    image, left, top = expand2square(image, tuple(int(x*255) for x in vqa_llm.image_processor.image_mean))
+    bbox_list = []
+    for bbox in bboxes:
+        bbox[0] += left
+        bbox[1] += top
+        bbox_list.append(bbox)
+    bbox_list = [normalize_bbox(bbox, image.width, image.height) for bbox in bbox_list]
+    focus_msg = "Additional visual information to focus on: "
+    cur_focus_msg = focus_msg
+    for i, (object_name, bbox) in enumerate(zip(object_names, bbox_list)):
+        cur_focus_msg = cur_focus_msg + "{} <object> at location [{:.3f},{:.3f},{:.3f},{:.3f}]".format(object_name, bbox[0], bbox[1], bbox[2], bbox[3])
+        if i != len(bbox_list)-1:
+            cur_focus_msg = cur_focus_msg+"; "
+        else:
+            cur_focus_msg = cur_focus_msg +'.'
+    question_with_focus = cur_focus_msg+"\n"+prompt
+
+    response = vqa_llm.free_form_inference(image, question_with_focus, object_crops=object_crops, objects_long=objects_long, images_long=images_long)
+    return VQAOutput(
+        generated_text=response,
+    )
diff --git a/examples/Vstar/docs/images/vstar_workflow.jpg b/examples/Vstar/docs/images/vstar_workflow.jpg
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		name: VStarLLM
		endpoint: ${env\| custom_vstar_endpoint, https://vstar.om-ai.com}