Skip to content

Conversation

@ahibrahimm
Copy link
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings November 10, 2025 21:33
@ahibrahimm ahibrahimm requested a review from a team as a code owner November 10, 2025 21:33
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Nov 10, 2025
Copilot finished reviewing on behalf of ahibrahimm November 10, 2025 21:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the GroundednessEvaluator to handle edge cases by introducing separate prompty flows for evaluations with and without query parameters. The changes aim to improve the evaluator's handling of different input scenarios.

Key Changes:

  • Introduces dual flow initialization (_flow_with_query and _flow_no_query) to support different prompty templates based on whether a query is provided
  • Adds helper methods _validate_context and _is_single_entry for improved input validation
  • Implements edge case handling for scenarios with invalid context and single-entry inputs
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_groundedness/_groundedness.py:364

  • Significant code duplication. The entire _do_eval_with_flow method (lines 291-364) is a near-exact copy of the parent class's _do_eval method from _base_prompty_eval.py (lines 118-191). This violates the DRY principle and creates a maintenance burden. Consider refactoring the parent class to accept an optional flow parameter, or use composition to avoid duplicating ~75 lines of code.
        and other fields depending on the child class.
        :type eval_input: Dict
        :return: The evaluation result.
        :rtype: Dict
        """
        if "query" not in eval_input and "response" not in eval_input:
            raise EvaluationException(
                message="Only text conversation inputs are supported.",
                internal_message="Only text conversation inputs are supported.",
                blame=ErrorBlame.USER_ERROR,
                category=ErrorCategory.INVALID_VALUE,
                target=ErrorTarget.CONVERSATION,
            )
        # Call the prompty flow to get the evaluation result.
        prompty_output_dict = await flow(timeout=self._LLM_CALL_TIMEOUT, **eval_input)

        score = math.nan
        if prompty_output_dict:
            llm_output = prompty_output_dict.get("llm_output", "")
            input_token_count = prompty_output_dict.get("input_token_count", 0)
            output_token_count = prompty_output_dict.get("output_token_count", 0)
            total_token_count = prompty_output_dict.get("total_token_count", 0)
            finish_reason = prompty_output_dict.get("finish_reason", "")
            model_id = prompty_output_dict.get("model_id", "")
            sample_input = prompty_output_dict.get("sample_input", "")
            sample_output = prompty_output_dict.get("sample_output", "")
            # Parse out score and reason from evaluators known to possess them.
            if self._result_key in PROMPT_BASED_REASON_EVALUATORS:
                score, reason = parse_quality_evaluator_reason_score(llm_output)
                binary_result = self._get_binary_result(score)
                return {
                    self._result_key: float(score),
                    f"gpt_{self._result_key}": float(score),
                    f"{self._result_key}_reason": reason,
                    f"{self._result_key}_result": binary_result,
                    f"{self._result_key}_threshold": self._threshold,
                    f"{self._result_key}_prompt_tokens": input_token_count,
                    f"{self._result_key}_completion_tokens": output_token_count,
                    f"{self._result_key}_total_tokens": total_token_count,
                    f"{self._result_key}_finish_reason": finish_reason,
                    f"{self._result_key}_model": model_id,
                    f"{self._result_key}_sample_input": sample_input,
                    f"{self._result_key}_sample_output": sample_output,
                }
            match = re.search(r"\d", llm_output)
            if match:
                score = float(match.group())
                binary_result = self._get_binary_result(score)
            return {
                self._result_key: float(score),
                f"gpt_{self._result_key}": float(score),
                f"{self._result_key}_result": binary_result,
                f"{self._result_key}_threshold": self._threshold,
                f"{self._result_key}_prompt_tokens": input_token_count,
                f"{self._result_key}_completion_tokens": output_token_count,
                f"{self._result_key}_total_tokens": total_token_count,
                f"{self._result_key}_finish_reason": finish_reason,
                f"{self._result_key}_model": model_id,
                f"{self._result_key}_sample_input": sample_input,
                f"{self._result_key}_sample_output": sample_output,
            }

        binary_result = self._get_binary_result(score)
        return {
            self._result_key: float(score),
            f"gpt_{self._result_key}": float(score),
            f"{self._result_key}_result": binary_result,
            f"{self._result_key}_threshold": self._threshold,
        }

    async def _real_call(self, **kwargs):
        """The asynchronous call where real end-to-end evaluation logic is performed.

        :keyword kwargs: The inputs to evaluate.

whatever inputs are needed for the _flow method, including context
and other fields depending on the child class.
:type eval_input: Dict
:return: The evaluation result.
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment. The NOTE on line 293 states "This is copy from parent" but doesn't explain why the copy is necessary or reference a tracking issue for refactoring. Consider expanding this comment to explain the rationale and possibly link to a future work item for eliminating the duplication.

Copilot uses AI. Check for mistakes.
Comment on lines 219 to +234
UserAgentSingleton().value,
)
self._flow = AsyncPrompty.load(source=self._prompty_file, model=prompty_model_config)
flow = AsyncPrompty.load(
source=prompty_path,
model=prompty_model_config,
is_reasoning_model=self._is_reasoning_model,
**kwargs,
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _load_flow method creates and assigns to self._prompty_file and self._flow (lines 228, 234), but immediately sets them to None in the constructor (lines 122-123). These assignments are dead code and should be removed. The method should only construct and return the flow variable (lines 235-240).

Copilot uses AI. Check for mistakes.
:param eval_input: The input to the evaluator. Expected to contain
whatever inputs are needed for the _flow method, including context
and other fields depending on the child class.
:type eval_input: Dict
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect docstring description. The method performs groundedness evaluation, not relevance evaluation. Change "Do a relevance evaluation." to "Do a groundedness evaluation."

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants