Skip to content

Feature/reward task #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Feature/reward task #53

wants to merge 14 commits into from

Conversation

finitearth
Copy link
Owner

@finitearth finitearth commented Jul 18, 2025

  • implements new tasks: RewardTask (accepts a reward function mapping from the prediction to a score), and JudgeTask (uses an LLM to score the responses. Optionally also accepts groundtruth labels, allowing for "fuzzy matches").
  • core functionalities of classification task has been moved to base task to prevent code duplication for other tasks
  • CAPO now accepts input parameter "check_fs_accuracy" (default True) - in case of reward tasks the accuracy cannot be evaluated, so we will take the prediction of the downstream_llm as target of fs.
  • CAPO also accepts "create_fs_reasoning" (default is True): if set to false, just use input-output pairs from df_few_shots
  • introduces tag-extraction function, to centralize repeated code for extractions like "<final_answer>5</final_answer>"
  • boosted test coverage

Copy link

Coverage

Tests Skipped Failures Errors Time
84 0 💤 0 ❌ 0 🔥 0.960s ⏱️

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements new task types for reward-based and LLM-as-judge evaluation, refactors the task architecture to reduce code duplication, and introduces several utility functions to improve functionality and test coverage.

  • Implements RewardTask (accepts reward functions for prediction scoring) and JudgeTask (uses LLM to score responses with optional ground truth)
  • Refactors core evaluation functionality from ClassificationTask to BaseTask to enable code reuse across different task types
  • Adds utility functions for tag extraction and improves CAPO to handle scenarios where accuracy cannot be evaluated

Reviewed Changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
promptolution/tasks/base_task.py Major refactor moving evaluation logic from ClassificationTask to enable inheritance by new task types
promptolution/tasks/reward_tasks.py New RewardTask implementation for scoring predictions with custom reward functions
promptolution/tasks/judge_tasks.py New JudgeTask implementation for LLM-based evaluation with optional ground truth
promptolution/utils/formatting.py New utility module for tag extraction functionality
promptolution/optimizers/capo.py Added check_fs_accuracy parameter to handle reward tasks without ground truth
tests/ Comprehensive test coverage for new functionality and updated existing tests

@finitearth finitearth marked this pull request as ready for review July 21, 2025 14:18
@finitearth finitearth requested a review from mo374z as a code owner July 21, 2025 14:18
@finitearth
Copy link
Owner Author

tests are red right now, fix is in next PR

@finitearth finitearth requested a review from timo282 July 22, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant