-
Notifications
You must be signed in to change notification settings - Fork 2
Feature/reward task #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements new task types for reward-based and LLM-as-judge evaluation, refactors the task architecture to reduce code duplication, and introduces several utility functions to improve functionality and test coverage.
- Implements RewardTask (accepts reward functions for prediction scoring) and JudgeTask (uses LLM to score responses with optional ground truth)
- Refactors core evaluation functionality from ClassificationTask to BaseTask to enable code reuse across different task types
- Adds utility functions for tag extraction and improves CAPO to handle scenarios where accuracy cannot be evaluated
Reviewed Changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
promptolution/tasks/base_task.py | Major refactor moving evaluation logic from ClassificationTask to enable inheritance by new task types |
promptolution/tasks/reward_tasks.py | New RewardTask implementation for scoring predictions with custom reward functions |
promptolution/tasks/judge_tasks.py | New JudgeTask implementation for LLM-based evaluation with optional ground truth |
promptolution/utils/formatting.py | New utility module for tag extraction functionality |
promptolution/optimizers/capo.py | Added check_fs_accuracy parameter to handle reward tasks without ground truth |
tests/ | Comprehensive test coverage for new functionality and updated existing tests |
tests are red right now, fix is in next PR |
Uh oh!
There was an error while loading. Please reload this page.