-
Notifications
You must be signed in to change notification settings - Fork 126
Open
Labels
featureNew feature or requestNew feature or request
Description
Feature description
Implement an agent evaluation pipeline, similar to QuestionAnswerPipeline
, that targets specific agentic domains (including SWE, function calling, reasoning, etc.)
Motivation
Creating, bench-marking, and maintaining an agent is a difficult task. As such, an easy to use pipeline would make the CI/CD process much more manageable.
Additional context
Example Usage/Output
$ uv run python examples/evaluate/code-generation/human_eval/run.py
Metrics:
humaneval_pass@1: 1.0000
humaneval_pass@5: 1.0000
humaneval_compile_rate: 1.0000
humaneval_syntax_error_rate: 0.0000
humaneval_assert_fail_rate: 0.0000
humaneval_runtime_error_rate: 0.0000
humaneval_timeout_rate: 0.0000
humaneval_tasks_solved: 1.0000
humaneval_avg_exec_time_sec: 0.0034
Metadata
Metadata
Assignees
Labels
featureNew feature or requestNew feature or request
Type
Projects
Status
In review