This project implements a complete, command-line-driven pipeline for converting the content of a website into a searchable knowledge base, which can then be queried through an interactive agent. It employs a Retrieval-Augmented Generation (RAG) architecture, utilizing Google's Gemini models for content processing and conversational AI, and LanceDB for efficient vector storage and retrieval.
The system operates through a sequential four-step pipeline, managed via a command-line interface. Each step is a modular component that processes the output of the previous one.
The initial step involves crawling and ingesting content from a target website.
- Process: The
BaseUrlScraperclass uses therequestslibrary to perform HTTP requests andBeautifulSoupto parse HTML content. It starts from a given base URL and recursively follows all same-domain links to discover and download pages. To avoid fetching irrelevant content, specific URL paths can be excluded. - Content Conversion: For each downloaded page, the raw HTML is cleaned, and the
trafilaturalibrary is used to convert the primary content into Markdown format. This focuses on extracting the core text while discarding boilerplate like navigation menus and footers. - Output: Each scraped page is saved as a separate
.mdfile in a designated working directory. The original source URL is preserved in the file's YAML frontmatter for traceability.
After scraping, the raw Markdown content is segmented into smaller, semantically coherent chunks suitable for vector embedding and retrieval.
- Process: The
MarkdownChunkerclass sends the content of each Markdown file to a Gemini model. A carefully designed system prompt (CHUNKING_SYSTEM_PROMP) instructs the model to break the text into self-contained chunks, each focusing on a single topic. The model is guided to include titles, content, and associated code blocks within each chunk. - Output: The resulting chunks are stored in
.jsonfiles. Each JSON file corresponds to an original source page and contains a list of structured chunk objects, includingtitle,content, and optionalcodefields.
This step converts the textual chunks into numerical vector representations, enabling semantic search.
- Process: The
ChunkEmbedderclass iterates through the JSON files produced in the previous step. For each chunk, it concatenates the title and content and uses thetext-embedding-004model via the Gemini API to generate a vector embedding. - Data Storage: The generated vector, along with the original text, any associated code, and the source URL, is compiled into a document.
- Output: All documents are collected and used to create a LanceDB table. This creates a persistent, efficient vector database, which is overwritten on each run to ensure freshness.
The final component is an interactive command-line agent that allows users to query the knowledge base.
- Process: The
RagAgentclass initializes an agent using thepydantic-ailibrary and a Gemini model. It equips the agent with a single tool:query_knowledge_base. When a user asks a question, the agent first uses this tool. - Retrieval: The tool takes the user's query, generates a vector embedding for it, and performs a similarity search against the LanceDB database to find the top 5 most relevant chunks.
- Generation: The retrieved chunks are compiled into a context block, which is then passed to the Gemini model along with the original question. The model synthesizes this information to generate a comprehensive answer.
- Output: The agent streams the final answer to the console. The response includes the synthesized information, relevant code examples, and the source URLs from which the information was retrieved.
The entire pipeline is orchestrated via main.py.
Scrapes a website and stores the content as Markdown files.
python main.py scrape <workdir> <url> [--exclude /path1 /path2 ...]workdir: The directory to store the output.mdfiles.url: The base URL to begin scraping from.--exclude: (Optional) A space-separated list of URL paths to exclude.
Converts the Markdown files into structured JSON chunk files.
python main.py chunk <workdir>workdir: The directory containing the.mdfiles to process.
Creates a LanceDB vector database from the JSON chunk files.
python main.py embed <dbfile> <pattern> --collection <name>dbfile: The path to the LanceDB database directory.pattern: A glob pattern to find the input.jsonchunk files (e.g.,"./data/*.json").--collection: The name of the table to create within the database.
Starts the interactive chat agent.
python main.py agent <dbfile> --collection <name>dbfile: The path to the LanceDB database.--collection: The name of the collection to query.