-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[Groundedness] handle edge cases by copy #43923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the GroundednessEvaluator to handle edge cases by introducing separate prompty flows for evaluations with and without query parameters. The changes aim to improve the evaluator's handling of different input scenarios.
Key Changes:
- Introduces dual flow initialization (
_flow_with_queryand_flow_no_query) to support different prompty templates based on whether a query is provided - Adds helper methods
_validate_contextand_is_single_entryfor improved input validation - Implements edge case handling for scenarios with invalid context and single-entry inputs
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py:364
- Significant code duplication. The entire
_do_eval_with_flowmethod (lines 291-364) is a near-exact copy of the parent class's_do_evalmethod from_base_prompty_eval.py(lines 118-191). This violates the DRY principle and creates a maintenance burden. Consider refactoring the parent class to accept an optional flow parameter, or use composition to avoid duplicating ~75 lines of code.
and other fields depending on the child class.
:type eval_input: Dict
:return: The evaluation result.
:rtype: Dict
"""
if "query" not in eval_input and "response" not in eval_input:
raise EvaluationException(
message="Only text conversation inputs are supported.",
internal_message="Only text conversation inputs are supported.",
blame=ErrorBlame.USER_ERROR,
category=ErrorCategory.INVALID_VALUE,
target=ErrorTarget.CONVERSATION,
)
# Call the prompty flow to get the evaluation result.
prompty_output_dict = await flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)
score = math.nan
if prompty_output_dict:
llm_output = prompty_output_dict.get("llm_output", "")
input_token_count = prompty_output_dict.get("input_token_count", 0)
output_token_count = prompty_output_dict.get("output_token_count", 0)
total_token_count = prompty_output_dict.get("total_token_count", 0)
finish_reason = prompty_output_dict.get("finish_reason", "")
model_id = prompty_output_dict.get("model_id", "")
sample_input = prompty_output_dict.get("sample_input", "")
sample_output = prompty_output_dict.get("sample_output", "")
# Parse out score and reason from evaluators known to possess them.
if self._result_key in PROMPT_BASED_REASON_EVALUATORS:
score, reason = parse_quality_evaluator_reason_score(llm_output)
binary_result = self._get_binary_result(score)
return {
self._result_key: float(score),
f"gpt_{self._result_key}": float(score),
f"{self._result_key}_reason": reason,
f"{self._result_key}_result": binary_result,
f"{self._result_key}_threshold": self._threshold,
f"{self._result_key}_prompt_tokens": input_token_count,
f"{self._result_key}_completion_tokens": output_token_count,
f"{self._result_key}_total_tokens": total_token_count,
f"{self._result_key}_finish_reason": finish_reason,
f"{self._result_key}_model": model_id,
f"{self._result_key}_sample_input": sample_input,
f"{self._result_key}_sample_output": sample_output,
}
match = re.search(r"\d", llm_output)
if match:
score = float(match.group())
binary_result = self._get_binary_result(score)
return {
self._result_key: float(score),
f"gpt_{self._result_key}": float(score),
f"{self._result_key}_result": binary_result,
f"{self._result_key}_threshold": self._threshold,
f"{self._result_key}_prompt_tokens": input_token_count,
f"{self._result_key}_completion_tokens": output_token_count,
f"{self._result_key}_total_tokens": total_token_count,
f"{self._result_key}_finish_reason": finish_reason,
f"{self._result_key}_model": model_id,
f"{self._result_key}_sample_input": sample_input,
f"{self._result_key}_sample_output": sample_output,
}
binary_result = self._get_binary_result(score)
return {
self._result_key: float(score),
f"gpt_{self._result_key}": float(score),
f"{self._result_key}_result": binary_result,
f"{self._result_key}_threshold": self._threshold,
}
async def _real_call(self, **kwargs):
"""The asynchronous call where real end-to-end evaluation logic is performed.
:keyword kwargs: The inputs to evaluate.
...valuation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py
Show resolved
Hide resolved
| whatever inputs are needed for the _flow method, including context | ||
| and other fields depending on the child class. | ||
| :type eval_input: Dict | ||
| :return: The evaluation result. |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment. The NOTE on line 293 states "This is copy from parent" but doesn't explain why the copy is necessary or reference a tracking issue for refactoring. Consider expanding this comment to explain the rationale and possibly link to a future work item for eliminating the duplication.
| UserAgentSingleton().value, | ||
| ) | ||
| self._flow = AsyncPrompty.load(source=self._prompty_file, model=prompty_model_config) | ||
| flow = AsyncPrompty.load( | ||
| source=prompty_path, | ||
| model=prompty_model_config, | ||
| is_reasoning_model=self._is_reasoning_model, | ||
| **kwargs, |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _load_flow method creates and assigns to self._prompty_file and self._flow (lines 228, 234), but immediately sets them to None in the constructor (lines 122-123). These assignments are dead code and should be removed. The method should only construct and return the flow variable (lines 235-240).
| :param eval_input: The input to the evaluator. Expected to contain | ||
| whatever inputs are needed for the _flow method, including context | ||
| and other fields depending on the child class. | ||
| :type eval_input: Dict |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect docstring description. The method performs groundedness evaluation, not relevance evaluation. Change "Do a relevance evaluation." to "Do a groundedness evaluation."
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines