Skip to content

[Feature Request] Add separate leaderboard tracks for "Thinking" and "Non-Thinking" evaluations #157

@Alice39s

Description

@Alice39s

🥰 Feature Description

This request builds upon the potential bug regarding model aliases (Issue #156). While the powerful 'thinking' or 'reasoning' features (e.g. from OpenAI and Anthropic) provide significant performance boosts, they make direct comparisons difficult.

🧐 Proposed Solution

Therefore, I propose creating two distinct tracks on the leaderboard.

  1. Base Capability Track (Non-Thinking):

    • All models run in their standard API mode without any reasoning enhancements.
    • Focuses on fundamental performance, measuring critical metrics like TTFT (Time to First Token) and TPS (Tokens per Second).
  2. Peak Performance Track (Thinking-Enabled):

    • Explicitly enables reasoning features for all supported models (e.g., gpt-5-thinking, Claude's extended reasoning).
    • Showcases the upper limit of each model's capability in solving complex, multi-step problems.
    • Could report metrics like average reasoning tokens used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions