Skip to content

feat: eval todo list #829

@mhordynski

Description

@mhordynski

Feature description

Implement an agent evaluation pipeline, similar to QuestionAnswerPipeline, that targets specific agentic domains (including SWE, function calling, reasoning, etc.)

Motivation

Creating, bench-marking, and maintaining an agent is a difficult task. As such, an easy to use pipeline would make the CI/CD process much more manageable.

Additional context

Example Usage/Output

$ uv run python examples/evaluate/code-generation/human_eval/run.py 

Metrics:
  humaneval_pass@1: 1.0000
  humaneval_pass@5: 1.0000
  humaneval_compile_rate: 1.0000
  humaneval_syntax_error_rate: 0.0000
  humaneval_assert_fail_rate: 0.0000
  humaneval_runtime_error_rate: 0.0000
  humaneval_timeout_rate: 0.0000
  humaneval_tasks_solved: 1.0000
  humaneval_avg_exec_time_sec: 0.0034

Metadata

Metadata

Assignees

Labels

featureNew feature or request

Type

No type

Projects

Status

In review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions