Skip to content

FSoft-AI4Code/CodeWiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

AI-Powered Repository Documentation Generation β€’ Multi-Language Support β€’ Architecture-Aware Analysis

Generate holistic, structured documentation for large-scale codebases β€’ Cross-module interactions β€’ Visual artifacts and diagrams

Python version License: MIT GitHub stars arXiv

Quick Start β€’ CLI Commands β€’ Output Structure β€’ Paper

CodeWiki Framework


Quick Start

1. Install CodeWiki

# Install from source
pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git

# Verify installation
codewiki --version

2. Configure Your Environment

codewiki config set \
  --api-key YOUR_API_KEY \
  --base-url https://api.anthropic.com \
  --main-model claude-sonnet-4 \
  --cluster-model claude-sonnet-4

3. Generate Documentation

# Navigate to your project
cd /path/to/your/project

# Generate documentation
codewiki generate

# Generate with HTML viewer for GitHub Pages
codewiki generate --github-pages --create-branch

That's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.

Usage Example

CLI Usage Example


What is CodeWiki?

CodeWiki is an open-source framework for automated repository-level documentation across seven programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.

Key Innovations

Innovation Description Impact
Hierarchical Decomposition Dynamic programming-inspired strategy that preserves architectural context Handles codebases of arbitrary size (86K-1.4M LOC tested)
Recursive Agentic System Adaptive multi-agent processing with dynamic delegation capabilities Maintains quality while scaling to repository-level scope
Multi-Modal Synthesis Generates textual documentation, architecture diagrams, data flows, and sequence diagrams Comprehensive understanding from multiple perspectives

Supported Languages

🐍 Python β€’ β˜• Java β€’ 🟨 JavaScript β€’ πŸ”· TypeScript β€’ βš™οΈ C β€’ πŸ”§ C++ β€’ πŸͺŸ C#


CLI Commands

Configuration Management

# Set up your API configuration
codewiki config set \
  --api-key <your-api-key> \
  --base-url <provider-url> \
  --main-model <model-name> \
  --cluster-model <model-name>

# Show current configuration
codewiki config show

# Validate your configuration
codewiki config validate

Documentation Generation

# Basic generation
codewiki generate

# Custom output directory
codewiki generate --output ./documentation

# Create git branch for documentation
codewiki generate --create-branch

# Generate HTML viewer for GitHub Pages
codewiki generate --github-pages

# Enable verbose logging
codewiki generate --verbose

# Full-featured generation
codewiki generate --create-branch --github-pages --verbose

Configuration Storage

  • API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service)
  • Settings: ~/.codewiki/config.json

Documentation Output

Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.

Textual Documentation

  • Repository overview with architecture guide
  • Module-level documentation with API references
  • Usage examples and implementation patterns
  • Cross-module interaction analysis

Visual Artifacts

  • System architecture diagrams (Mermaid)
  • Data flow visualizations
  • Dependency graphs and module relationships
  • Sequence diagrams for complex interactions

Output Structure

./docs/
β”œβ”€β”€ overview.md              # Repository overview (start here!)
β”œβ”€β”€ module1.md               # Module documentation
β”œβ”€β”€ module2.md               # Additional modules...
β”œβ”€β”€ module_tree.json         # Hierarchical module structure
β”œβ”€β”€ first_module_tree.json   # Initial clustering result
β”œβ”€β”€ metadata.json            # Generation metadata
└── index.html               # Interactive viewer (with --github-pages)

Experimental Results

CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.

Performance by Language Category

Language Category CodeWiki (Sonnet-4) DeepWiki Improvement
High-Level (Python, JS, TS) 79.14% 68.67% +10.47%
Managed (C#, Java) 68.84% 64.80% +4.04%
Systems (C, C++) 53.24% 56.39% -3.15%
Overall Average 68.79% 64.06% +4.73%

Results on Representative Repositories

Repository Language LOC CodeWiki-Sonnet-4 DeepWiki Improvement
All-Hands-AI--OpenHands Python 229K 82.45% 73.04% +9.41%
puppeteer--puppeteer TypeScript 136K 83.00% 64.46% +18.54%
sveltejs--svelte JavaScript 125K 71.96% 68.51% +3.45%
Unity-Technologies--ml-agents C# 86K 79.78% 74.80% +4.98%
elastic--logstash Java 117K 57.90% 54.80% +3.10%

View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.


How It Works

Architecture Overview

CodeWiki employs a three-stage process for comprehensive documentation generation:

  1. Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.

  2. Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.

  3. Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Codebase      │───▢│  Hierarchical    │───▢│  Multi-Agent    β”‚
β”‚   Analysis      β”‚    β”‚  Decomposition   β”‚    β”‚  Processing     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚                        β”‚
                                β–Ό                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Visual        │◀───│  Multi-Modal     │◀───│  Structured     β”‚
β”‚   Artifacts     β”‚    β”‚  Synthesis       β”‚    β”‚  Content        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Requirements

  • Python 3.12+
  • Node.js (for Mermaid diagram validation)
  • LLM API access (Anthropic Claude, OpenAI, etc.)
  • Git (for branch creation features)

Additional Resources

Documentation & Guides

Academic Resources

  • Paper - Full research paper with detailed methodology and results
  • Citation - How to cite CodeWiki in your research

Citation

If you use CodeWiki in your research, please cite:

@misc{hoang2025codewikievaluatingaisability,
      title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases},
      author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui},
      year={2025},
      eprint={2510.24428},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2510.24428},
}

Star History

Star History Chart


License

This project is licensed under the MIT License - see the LICENSE file for details.