Skip to content

Multi-Modal RAG (Images + Text) #5

@DhruvGoyal375

Description

@DhruvGoyal375

Description

Create a RAG-based system that retrieves relevant text and images in response to a user query. The model should support multi-modal retrieval and generation.

Features

  • Process both text and images for retrieval.
  • Use separate embeddings for text and images.
  • Implement multi-modal query handling.
  • Generate responses combining both modalities.

Tech Stack (Suggestions, not limited to)

  • Image Processing: CLIP, OpenAI Vision, Google Vision API
  • Text Embedding Models: OpenAI, SBERT, Cohere
  • Vector Database: FAISS, Weaviate, Pinecone
  • Backend: Python (FastAPI, Flask)

Resources

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions