Codebase QA: Chat with Your Codebase

Overview

Codebase QA is an interactive Streamlit-based tool that allows you to chat with your project's codebase using Retrieval-Augmented Generation (RAG) powered by local large language models (LLMs) like those from Ollama. It analyzes your project files, builds a vector database for semantic search, and enables natural language queries about your code—such as understanding business logic, impact analysis, UI flows, or technical details.

The tool supports various project types (e.g., Android, iOS, Python, JavaScript/TypeScript) and intelligently chunks code with meaningful metadata for better context retrieval. It uses Chroma as the vector database and focuses on incremental updates, semantic awareness, and efficient querying.

This architecture reduces custom code by leveraging established libraries, making it maintainable and performant. It evolved from a more complex 15-file system to a streamlined setup with 2-3 core files, while providing advanced features like hierarchical indexing and query intent classification.

A preview shown with android project chosen:

Key Features

Project Type Detection: Automatically detects or lets you select project types (e.g., Android, Python, JavaScript) and processes relevant file extensions.
RAG Index Building: Scans project directories, chunks code semantically (e.g., by classes, functions, imports), adds rich metadata (dependencies, complexity scores, UI elements), and stores in a Chroma vector database.
Local LLM Integration: Uses Ollama for embeddings and query processing, keeping everything local and private.
Chat Interface: Ask questions about your codebase with context-aware responses, including source attribution, impact analysis, and debug tools.
Incremental Updates: Tracks file changes via Git (if available) or custom hashing, reindexing only modified files for efficiency.
Advanced Retrieval: Supports query intent classification (e.g., overview, business logic, UI flow), hierarchical indexing, and relevance scoring.
Debugging Tools: Inspect vector DB, test chunking/embeddings/retrieval, view project structure, and force rebuilds.
Metadata-Rich Chunks: Extracts dependencies, API endpoints, business logic indicators, validations, and more for precise context building.

Benefits of the New Architecture

80% Less Custom Code: Leverages libraries like LangChain, Chroma, and Ollama for core RAG operations.
Language-Aware Chunking: Automatic splitting with semantic awareness (e.g., functions, classes, imports) and overlap for better context.
Metadata Extraction: Captures imports, functions, classes, dependencies, UI elements, and complexity metrics.
Context-Aware Chat: Intent classification, query rewriting, impact analysis, and reranked sources for relevant answers.
Incremental Indexing: Only processes changed files using Git integration or fallback hashing.
Semantic Search: Relevance scoring and hierarchical indexes for multi-level querying.
Source Attribution: Tracks and displays exact sources in responses.
Robust Error Handling: Graceful management of processing errors.
Optimized Performance: Efficient vector operations and token management.
Easy Maintenance: Built on actively maintained open-source libraries.

Installation

Prerequisites:
- Python 3.8 or higher.
- Ollama installed and running locally (download from ollama.ai).
- Git (optional, for advanced file tracking).

Clone the Repository:

git clone https://github.com/your-repo/codebase-qa.git
cd codebase-qa

Install Dependencies:
```
pip install -r requirements.txt
```
Run Ollama: Start Ollama and pull a model (e.g., ollama pull llama3).

Usage

Launch the App:
```
streamlit run app.py
```
Configure in the Sidebar:

Select your project directory.
Choose the project type (e.g., Python, JavaScript).
Pick a local Ollama model and endpoint.
Toggle force rebuild or debug mode if needed.

Build the Index:

Click "Rebuild Index" to process files and create the vector database (stored in ./vector_db/).
The app logs progress, showing processed files, chunks, and stats.

Chat with Your Codebase:

Once ready, enter queries like:
- "What is the main business logic in this app?"
- "What happens if I change file X?"
- "Explain the UI flow for the login screen."
Responses include generated answers, source documents, and impact analysis.

Debug Mode:

Enable to access tools for inspecting the vector DB, testing chunking, embeddings, retrieval, and more.

Project Structure

codebase-qa/
├── app.py                      # Main Streamlit app orchestrator
├── ui_components.py            # UI rendering components
├── chat_handler.py             # Chat processing with intent classification
├── rag_manager.py              # RAG setup and management
├── build_rag.py                # RAG building, indexing, and dependency extraction
├── chunker_factory.py          # Semantic-aware chunking
├── config.py                   # Project configurations and auto-detection
├── git_hash_tracker.py         # File change tracking (Git or custom)
├── debug_tools.py              # Debugging utilities
├── metadata_extractor.py       # Enhanced metadata extraction
├── query_classifier.py         # Query intent classification
├── context_builder.py          # Advanced context building for queries
├── hierarchical_indexer.py     # Multi-level hierarchical indexing
└── vector_db/                  # Generated vector database (Chroma)

Generated files include:

code_relationships.json: Dependency mappings.
git_tracking.json: File change tracking info.

How It Works

Input Selection:
- User picks project directory, type, and LLM model.
Index Building:
- Scans files based on extensions (e.g., .py for Python).
- Chunks content semantically with overlaps and metadata (e.g., dependencies, complexity).
- Builds hierarchical indexes and stores in Chroma.
Query Processing:
- Classifies intent (e.g., overview, impact analysis).
- Rewrites query, retrieves relevant chunks, builds enhanced context.
- Passes to local LLM for response generation.
Output: Displays answer with sources, impacted files, and debug info.

Contributing

Contributions are welcome! Please submit pull requests for bug fixes, features, or improvements. Follow standard Python best practices.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Built with Streamlit for the UI.
Powered by LangChain for RAG and Ollama for local LLMs.
Uses Chroma for vector storage.

🔧 Backlog & Fine-Tuning Tasks

These items focus on enhancing the precision, performance, and observability of context-aware retrieval in the system:

📦 Contextual Chunking & Metadata

Improve chunking logic to better align with semantic boundaries.
Fine-tune metadata extraction to enrich context awareness for downstream retrieval.

💬 Query Rewriting & Prompt Caching

Rewrite user queries into richer retrieval prompts.
Integrate cache-based context to boost relevance of RAG-generated responses.

🐞 Debugging UI Enhancements

Build a real-time debugging UI.
Display live logs for background tasks such as:
- Chunk construction
- Metadata sanitization
- Index updates

🧠 Reinforcement Learning & Evaluation

Introduce reinforcement learning during chunking and metadata processing.
Evaluate effectiveness using a predefined question set.
Adjust logic dynamically based on retrieval quality of responses.

Note: All systems are in place—these backlog items are focused on refinement and optimization. Feel free to make changes and raise PRs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Codebase QA: Chat with Your Codebase

Overview

Key Features

Benefits of the New Architecture

Installation

Usage

Project Structure

How It Works

Contributing

License

Acknowledgments

🔧 Backlog & Fine-Tuning Tasks

📦 Contextual Chunking & Metadata

💬 Query Rewriting & Prompt Caching

🐞 Debugging UI Enhancements

🧠 Reinforcement Learning & Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Formula		Formula
debug_tools		debug_tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
app.py		app.py
build_rag.py		build_rag.py
chat_handler.py		chat_handler.py
chunker_factory.py		chunker_factory.py
cli.py		cli.py
config.py		config.py
context_builder.py		context_builder.py
git_hash_tracker.py		git_hash_tracker.py
hierarchical_indexer.py		hierarchical_indexer.py
logger.py		logger.py
metadata_extractor.py		metadata_extractor.py
process_manager.py		process_manager.py
pyproject.toml		pyproject.toml
query_intent_classifier.py		query_intent_classifier.py
rag_manager.py		rag_manager.py
requirements.txt		requirements.txt
setup.py		setup.py
ui_components.py		ui_components.py

License

anilbattini/codebase-qa

Folders and files

Latest commit

History

Repository files navigation

Codebase QA: Chat with Your Codebase

Overview

Key Features

Benefits of the New Architecture

Installation

Usage

Project Structure

How It Works

Contributing

License

Acknowledgments

🔧 Backlog & Fine-Tuning Tasks

📦 Contextual Chunking & Metadata

💬 Query Rewriting & Prompt Caching

🐞 Debugging UI Enhancements

🧠 Reinforcement Learning & Evaluation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages