ai-benchmarking

Here are 6 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

jabberjabberjabber / Context-Tester

Star

Generating data for benchmarking effect of context window size on LLM creativity in story writing

tokenizer llm koboldcpp ai-benchmarking

Updated Sep 22, 2025
Python

ImBIOS / ide-ai-benchmark

Sponsor

Star

Comprehensive multi-IDE AI model benchmarking framework supporting Cursor, Windsurf, VSCode, and other IDEs with automated testing and performance comparison capabilities

testing performance vscode openai cursor copilot claude windsurf ai-benchmarking ide-automation

Updated Jul 14, 2025
Python

xerk-dot / medical-coding-ai

Star

A comprehensive benchmarking platform for CPT, ICD-10, and HCPCS coding questions. Identifies the most reliable models for healthcare applications. Evaluates multiple AI models on medical coding expertise through iterative consensus-building.

healthcare icd-10 multi-agent-systems medical-ai medical-coding consensus-algorithms openrouter cpt-codes ai-benchmarking