[Groundedness] handle edge cases by copy #43923

ahibrahimm · 2025-11-10T21:33:08Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Pull Request Overview

This PR refactors the GroundednessEvaluator to handle edge cases by introducing separate prompty flows for evaluations with and without query parameters. The changes aim to improve the evaluator's handling of different input scenarios.

Key Changes:

Introduces dual flow initialization (_flow_with_query and _flow_no_query) to support different prompty templates based on whether a query is provided
Adds helper methods _validate_context and _is_single_entry for improved input validation
Implements edge case handling for scenarios with invalid context and single-entry inputs

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py:364

Significant code duplication. The entire _do_eval_with_flow method (lines 291-364) is a near-exact copy of the parent class's _do_eval method from _base_prompty_eval.py (lines 118-191). This violates the DRY principle and creates a maintenance burden. Consider refactoring the parent class to accept an optional flow parameter, or use composition to avoid duplicating ~75 lines of code.

        and other fields depending on the child class.
        :type eval_input: Dict
        :return: The evaluation result.
        :rtype: Dict
        """
        if "query" not in eval_input and "response" not in eval_input:
            raise EvaluationException(
                message="Only text conversation inputs are supported.",
                internal_message="Only text conversation inputs are supported.",
                blame=ErrorBlame.USER_ERROR,
                category=ErrorCategory.INVALID_VALUE,
                target=ErrorTarget.CONVERSATION,
            )
        # Call the prompty flow to get the evaluation result.
        prompty_output_dict = await flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)

        score = math.nan
        if prompty_output_dict:
            llm_output = prompty_output_dict.get("llm_output", "")
            input_token_count = prompty_output_dict.get("input_token_count", 0)
            output_token_count = prompty_output_dict.get("output_token_count", 0)
            total_token_count = prompty_output_dict.get("total_token_count", 0)
            finish_reason = prompty_output_dict.get("finish_reason", "")
            model_id = prompty_output_dict.get("model_id", "")
            sample_input = prompty_output_dict.get("sample_input", "")
            sample_output = prompty_output_dict.get("sample_output", "")
            # Parse out score and reason from evaluators known to possess them.
            if self._result_key in PROMPT_BASED_REASON_EVALUATORS:
                score, reason = parse_quality_evaluator_reason_score(llm_output)
                binary_result = self._get_binary_result(score)
                return {
                    self._result_key: float(score),
                    f"gpt_{self._result_key}": float(score),
                    f"{self._result_key}_reason": reason,
                    f"{self._result_key}_result": binary_result,
                    f"{self._result_key}_threshold": self._threshold,
                    f"{self._result_key}_prompt_tokens": input_token_count,
                    f"{self._result_key}_completion_tokens": output_token_count,
                    f"{self._result_key}_total_tokens": total_token_count,
                    f"{self._result_key}_finish_reason": finish_reason,
                    f"{self._result_key}_model": model_id,
                    f"{self._result_key}_sample_input": sample_input,
                    f"{self._result_key}_sample_output": sample_output,
                }
            match = re.search(r"\d", llm_output)
            if match:
                score = float(match.group())
                binary_result = self._get_binary_result(score)
            return {
                self._result_key: float(score),
                f"gpt_{self._result_key}": float(score),
                f"{self._result_key}_result": binary_result,
                f"{self._result_key}_threshold": self._threshold,
                f"{self._result_key}_prompt_tokens": input_token_count,
                f"{self._result_key}_completion_tokens": output_token_count,
                f"{self._result_key}_total_tokens": total_token_count,
                f"{self._result_key}_finish_reason": finish_reason,
                f"{self._result_key}_model": model_id,
                f"{self._result_key}_sample_input": sample_input,
                f"{self._result_key}_sample_output": sample_output,
            }

        binary_result = self._get_binary_result(score)
        return {
            self._result_key: float(score),
            f"gpt_{self._result_key}": float(score),
            f"{self._result_key}_result": binary_result,
            f"{self._result_key}_threshold": self._threshold,
        }

    async def _real_call(self, **kwargs):
        """The asynchronous call where real end-to-end evaluation logic is performed.

        :keyword kwargs: The inputs to evaluate.