feat: eval todo list

### Feature description

Implement an agent evaluation pipeline, similar to `QuestionAnswerPipeline`,  that targets specific agentic domains (including SWE, function calling, reasoning, etc.)

### Motivation

Creating, bench-marking, and maintaining an agent is a difficult task. As such, an easy to use pipeline would make the CI/CD process much more manageable.

### Additional context

Example Usage/Output
```bash
$ uv run python examples/evaluate/code-generation/human_eval/run.py 

Metrics:
  humaneval_pass@1: 1.0000
  humaneval_pass@5: 1.0000
  humaneval_compile_rate: 1.0000
  humaneval_syntax_error_rate: 0.0000
  humaneval_assert_fail_rate: 0.0000
  humaneval_runtime_error_rate: 0.0000
  humaneval_timeout_rate: 0.0000
  humaneval_tasks_solved: 1.0000
  humaneval_avg_exec_time_sec: 0.0034
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: eval todo list #829

Feature description

Motivation

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: eval todo list #829

Description

Feature description

Motivation

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions