Skip to content

A simple Retrieval-Augmented Generation (RAG) project built with LangChain and Streamlit. Upload documents (PDF/TXT) and interact with them using natural language questions powered by embeddings and vector search.

License

Notifications You must be signed in to change notification settings

ZohaibCodez/document-qa-rag-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Document RAG Assistant

Transform any document into an interactive AI conversation using RAG (Retrieval Augmented Generation)

Python Streamlit LangChain License Issues Last Commit Live Demo

🎯 Overview

Document RAG Assistant is a Streamlit web application that enables you to have intelligent conversations with your document content. Simply upload a PDF or text file, and the app will process the document to create a searchable knowledge base that you can query using natural language.

🌐 Live Demo

Try it out here: Document RAG Assistant Live Demo

✨ Features

  • πŸ“ Multi-format Support: Process PDF and text files seamlessly
  • πŸ€– Multiple AI Models: Support for various Google Gemini models (2.5 Pro, Flash, 2.0 Flash, etc.)
  • πŸ’¬ Interactive Chat: Natural language conversation with document content
  • πŸ” Smart Search: Vector-based similarity search using FAISS
  • πŸ“Š Session Management: Chat history, export functionality, and session persistence
  • 🎨 Modern UI: Clean, responsive Streamlit interface with real-time updates
  • πŸ“ˆ Progress Tracking: Visual feedback during document processing
  • πŸ”„ Streaming Responses: Real-time AI response streaming with typing indicators
  • πŸ›‘οΈ Fallback System: Automatic HuggingFace embeddings if Google quota exceeded

πŸ› οΈ Tech Stack

  • Frontend: Streamlit
  • AI/ML: Google Gemini API, LangChain
  • Vector Store: FAISS (Facebook AI Similarity Search)
  • Embeddings: Google Generative AI Embeddings
  • Document Processing: PyPDFLoader, TextLoader
  • Environment: Python 3.11+, Docker support

πŸ“‹ Prerequisites

  • Python 3.11 or higher
  • Google Gemini API Key (Get one here)
  • Internet connection for API access

πŸš€ Quick Start

1. Clone the Repository

git clone https://github.com/ZohaibCodez/document-qa-rag-system.git
cd document-qa-rag-system

2. Install Dependencies

uv sync

3. Set Up Environment

uv venv
# Edit .env and add your Google API key

4. Run the Application

streamlit run app.py

5. Access the App

Open your browser and navigate to http://localhost:8501

πŸ”§ Configuration

Environment Variables

Create a .env file in the root directory:

GOOGLE_API_KEY=your_google_gemini_api_key_here

Alternatively, you can enter your API key directly in the app's sidebar.

Supported Models

  • gemini-2.5-pro (Most capable, recommended for complex analysis)
  • gemini-2.5-flash (Balanced performance and speed)
  • gemini-2.5-flash-lite (Lightweight and fast)
  • gemini-2.0-flash (Fast responses, good accuracy)
  • gemini-1.5-pro (Reliable baseline)
  • gemini-1.5-flash (Quick processing)

Configurable Parameters

CHUNK_SIZE = 1000          # Text chunk size for processing
CHUNK_OVERLAP = 100        # Overlap between chunks for context
RETRIEVER_K = 4           # Number of similar chunks to retrieve
EMBEDDING_MODEL = "models/gemini-embedding-exp-03-07"

πŸ“± How to Use

  1. Enter API Key: Add your Google Gemini API key in the sidebar
  2. Upload Document: Click "πŸ“ Upload your document" and select a PDF or TXT file
  3. Process Document: Click "πŸš€ Process Document" to extract and index the content
  4. Start Chatting: Ask questions about the document content in natural language
  5. Export Chat: Download your conversation history anytime using the sidebar

Supported File Formats

  • PDF: .pdf files (text-based, not scanned images)
  • Text: .txt files (plain text documents)
  • Size Limit: Up to 100MB (recommended: <10MB for optimal performance)

Example Queries

  • "What is the main topic of this document?"
  • "Summarize the key findings in chapter 3"
  • "What does the author say about machine learning?"
  • "List all the recommendations mentioned"
  • "Explain the methodology used in this research"

⚠️ Current Limitations

  • File Types: Currently supports only PDF and TXT formats
  • Language: Optimized for English documents
  • Processing Time: Large documents (>50 pages) may take longer to process
  • API Limits: Subject to Google Gemini API rate limits and quotas
  • Scanned PDFs: Does not support OCR for image-based PDFs

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  File Upload    │───▢│  Text Splitter   │───▢│   Embeddings    β”‚
β”‚  (PDF/TXT)      β”‚    β”‚  (Chunking)      β”‚    β”‚  (Google AI)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Streamlit UI   │◀───│   Chat Chain     │◀───│   FAISS Store   β”‚
β”‚   (Frontend)    β”‚    β”‚  (LangChain)     β”‚    β”‚ (Vector Search) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  Gemini Models   β”‚
                       β”‚ (Generation AI)  β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🐳 Docker Support

Using Docker Compose (Recommended)

# Create .env file with your API key
echo "GOOGLE_API_KEY=your-api-key-here" > .env

# Start the service
docker-compose up -d

# View logs
docker-compose logs -f

Using Docker directly

# Build image
docker build -t document-qa-rag-system .

# Run container
docker run -p 8501:8501 -e GOOGLE_API_KEY=your_api_key document-qa-rag-system

πŸ“ Project Structure

document-rag-assistant/
β”‚
β”œβ”€β”€ app.py              # Main Streamlit application
β”‚
β”œβ”€β”€ notebooks/
β”‚   └── rag_demo.ipynb    # Beginner-level RAG notebook demo
β”‚
β”œβ”€β”€ data/                   # Sample documents (PDF/TXT)
|   └── Stack vs Heap Memory.txt
|   └── FastAPI Modern Python Web Development.pdf
β”‚
β”œβ”€β”€ Dockerfile              # Container setup
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env.example            # Example API key file
β”œβ”€β”€ .gitignore              # Git ignore rules
└── README.md               # Project documentation

πŸ“Š Performance Metrics

  • Processing Speed: ~2-5 seconds for typical documents (10-50 pages)
  • Memory Usage: Optimized vector storage with FAISS
  • Accuracy: High precision with 4-chunk retrieval system
  • Container Size: ~380MB (optimized Docker image)
  • Response Time: Sub-second for most queries

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development with Docker

# Build development image
docker build -f Dockerfile.dev -t document-rag-dev .

# Run with live reload
docker run -p 8501:8501 -v $(pwd):/app document-rag-dev

πŸ“ Future Roadmap

  • Support for more document formats (DOCX, HTML, Markdown)
  • Multi-document conversation capabilities
  • OCR support for scanned PDFs
  • Advanced filtering and search options
  • Integration with cloud storage services (Google Drive, Dropbox)
  • API endpoint for programmatic access
  • Batch processing for multiple documents
  • Custom embedding model options
  • Multi-language document support
  • Document summarization features

πŸ› Known Issues

  • Large PDF files (>100MB) may cause memory issues
  • Some complex PDF layouts may not parse correctly
  • API rate limiting may affect performance during peak usage
  • Embedded images in PDFs are not processed

πŸ”§ Troubleshooting

Common Issues

"API key not found" error:

  • Ensure your Google Gemini API key is correctly set
  • Check that the key has proper permissions

Document processing fails:

  • Verify the document format is supported (PDF/TXT)
  • Ensure the file is not corrupted or password-protected

Slow processing:

  • Try using a smaller document or different model
  • Check your internet connection

Out of memory:

  • Reduce document size or restart the application
  • For Docker: increase memory limits

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Support

If you encounter any issues or have questions:


⭐ Star this repository if you found it helpful!

Built with πŸ–€ using Streamlit and Google Gemini AI

About

A simple Retrieval-Augmented Generation (RAG) project built with LangChain and Streamlit. Upload documents (PDF/TXT) and interact with them using natural language questions powered by embeddings and vector search.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published