You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This request builds upon the potential bug regarding model aliases (Issue #156). While the powerful 'thinking' or 'reasoning' features (e.g. from OpenAI and Anthropic) provide significant performance boosts, they make direct comparisons difficult.
🧐 Proposed Solution
Therefore, I propose creating two distinct tracks on the leaderboard.
Base Capability Track (Non-Thinking):
All models run in their standard API mode without any reasoning enhancements.
Focuses on fundamental performance, measuring critical metrics like TTFT (Time to First Token) and TPS (Tokens per Second).
Peak Performance Track (Thinking-Enabled):
Explicitly enables reasoning features for all supported models (e.g., gpt-5-thinking, Claude's extended reasoning).
Showcases the upper limit of each model's capability in solving complex, multi-step problems.
Could report metrics like average reasoning tokens used.