- 
                Notifications
    You must be signed in to change notification settings 
- Fork 90
qwen3_workarena_changes #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -412,6 +412,7 @@ def __init__(self, obs, flags: ObsFlags) -> None: | |
| visible=lambda: flags.use_html, | ||
| prefix="## ", | ||
| ) | ||
| obs["axtree_txt"] = remove_ui_patterns(obs["axtree_txt"]) | ||
| self.ax_tree = AXTree( | ||
| obs["axtree_txt"], | ||
| visible_elements_only=flags.filter_visible_elements_only, | ||
|  | @@ -874,3 +875,40 @@ def obs_mapping(obs: dict): | |
| return obs | ||
|  | ||
| return obs_mapping | ||
|  | ||
|  | ||
|  | ||
|  | ||
| import re | ||
|  | ||
| def remove_ui_patterns(text): | ||
| """ | ||
| Remove lines containing specific UI patterns for ServiceNow accessibility tree text. | ||
|  | ||
| Args: | ||
| text (str): The input string containing the accessibility tree | ||
|  | ||
| Returns: | ||
| str: The cleaned string with lines containing UI patterns removed | ||
| """ | ||
| 
      Comment on lines
    
      +884
     to 
      +893
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing Purpose in UI Pattern Removal Docstring Tell me moreWhat is the issue?The docstring lacks explanation of why these UI patterns need to be removed and what impact it has. Why this mattersWithout understanding the purpose, maintainers may inadvertently modify or remove this filtering which could impact the accessibility tree's usability. Suggested change ∙ Feature Previewdef remove_ui_patterns(text): Provide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
|  | ||
| # Words to look for | ||
| words_to_remove = ["Edit Widget", "Edit Widget Preferences", "Close Widget", "Add content"] | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hardcoded UI Pattern List Tell me moreWhat is the issue?UI patterns to remove are hardcoded as a list within the function, making it difficult to modify or reuse these patterns elsewhere. Why this mattersIf UI patterns need to be updated or reused in other parts of the code, developers would need to modify them in multiple places, increasing maintenance burden and risk of inconsistency. Suggested change ∙ Feature PreviewMove the UI patterns to a module-level constant or configuration: UI_PATTERNS_TO_REMOVE = [
    "Edit Widget",
    "Edit Widget Preferences", 
    "Close Widget",
    "Add content"
]
def remove_ui_patterns(text, patterns=UI_PATTERNS_TO_REMOVE):
    ...Provide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
|  | ||
| # Split text into lines | ||
| lines = text.split('\n') | ||
|  | ||
| # Keep lines that don't contain any of the words | ||
| filtered_lines = [] | ||
| for line in lines: | ||
| should_keep = True | ||
| for word in words_to_remove: | ||
| if word in line: | ||
| should_keep = False | ||
| break | ||
| if should_keep: | ||
| filtered_lines.append(line) | ||
|  | ||
| # Join the remaining lines back together | ||
| return '\n'.join(filtered_lines) | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -149,6 +149,48 @@ | |
| add_missparsed_messages=True, | ||
| ) | ||
|  | ||
| # qwen3 default config - same as llama3-70b but with use_think_history=False | ||
| FLAGS_QWEN3 = GenericPromptFlags( | ||
| obs=dp.ObsFlags( | ||
| use_html=False, | ||
| use_ax_tree=True, | ||
| use_focused_element=True, | ||
| use_error_logs=False, | ||
| use_history=True, | ||
| use_past_error_logs=False, | ||
| use_action_history=True, | ||
| use_think_history=False, # Qwen3 doesn't include thinking history | ||
| use_diff=False, | ||
| html_type="pruned_html", | ||
| use_screenshot=False, | ||
| use_som=False, | ||
| extract_visible_tag=True, | ||
| extract_clickable_tag=False, | ||
| extract_coords="False", | ||
| filter_visible_elements_only=False, | ||
| ), | ||
| action=dp.ActionFlags( | ||
| action_set=HighLevelActionSetArgs( | ||
| subsets=["bid"], | ||
| multiaction=False, | ||
| ), | ||
| long_description=False, | ||
| individual_examples=True, | ||
| ), | ||
| use_plan=False, | ||
| use_criticise=False, | ||
| use_thinking=True, | ||
| use_memory=False, | ||
| use_concrete_example=True, | ||
| use_abstract_example=True, | ||
| use_hints=True, | ||
| enable_chat=False, | ||
| max_prompt_tokens=40_000, | ||
| be_cautious=True, | ||
| extra_instructions=None, | ||
| add_missparsed_messages=True, | ||
| ) | ||
| 
      Comment on lines
    
      +153
     to 
      +192
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing AGENT_QWEN3 configuration Tell me moreWhat is the issue?The FLAGS_QWEN3 configuration is defined but no corresponding AGENT_QWEN3 is created to use it. Why this mattersWithout a corresponding agent configuration, the FLAGS_QWEN3 cannot be used by the system, making the configuration effectively dead code that serves no functional purpose. Suggested change ∙ Feature PreviewAdd a corresponding agent configuration after the FLAGS_QWEN3 definition: AGENT_QWEN3 = GenericAgentArgs(
    chat_model_args=CHAT_MODEL_ARGS_DICT["qwen3_model_key"],  # Replace with actual Qwen3 model key
    flags=FLAGS_QWEN3,
)Provide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. 
      Comment on lines
    
      +152
     to 
      +192
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Configuration Duplication Instead of Inheritance Tell me moreWhat is the issue?The code duplicates the entire configuration structure from FLAGS_LLAMA3_70B with only one parameter difference (use_think_history=False). Why this mattersCode duplication increases maintenance burden and risk of inconsistencies when updates are needed. If the base configuration changes, both copies need to be updated. Suggested change ∙ Feature Preview# Create Qwen3 flags by inheriting from LLAMA3_70B flags
FLAGS_QWEN3 = FLAGS_LLAMA3_70B.copy()
FLAGS_QWEN3.obs.use_think_history = False  # Override only the different parameterProvide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
|  | ||
| AGENT_LLAMA3_70B = GenericAgentArgs( | ||
| chat_model_args=CHAT_MODEL_ARGS_DICT["openrouter/meta-llama/llama-3-70b-instruct"], | ||
| flags=FLAGS_LLAMA3_70B, | ||
|  | ||
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -324,7 +324,7 @@ def __call__(self, messages: list[dict], n_samples: int = 1, temperature: float | |
| tracking.TRACKER.instance(input_tokens, output_tokens, cost) | ||
|  | ||
| if n_samples == 1: | ||
| res = AIMessage(completion.choices[0].message.content) | ||
| res = AIMessage(_extract_thinking_content_from_response(completion)) | ||
| if self.log_probs: | ||
| res["log_probs"] = completion.choices[0].log_probs | ||
| return res | ||
|  | @@ -593,3 +593,72 @@ def make_model(self): | |
| temperature=self.temperature, | ||
| max_tokens=self.max_new_tokens, | ||
| ) | ||
|  | ||
|  | ||
|  | ||
|  | ||
| import logging | ||
|  | ||
| def _extract_thinking_content_from_response(response, wrap_tag: str = "think"): | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Complex Function with Multiple Responsibilities Tell me moreWhat is the issue?The function is too complex and handles multiple responsibilities: message normalization, reasoning extraction, content parsing with multiple formats, and content wrapping. Why this mattersThe high complexity makes the code harder to maintain, test, and modify. Each responsibility could change for different reasons, violating the Single Responsibility Principle. Suggested change ∙ Feature PreviewSplit the function into smaller, focused functions: def _normalize_message(response):
    message = response.choices[0].message
    return message.to_dict() if not isinstance(message, dict) else message
def _extract_reasoning(message):
    return message.get('reasoning') or message.get('reasoning_content')
def _parse_content(raw_content):
    # Handle string vs list content parsing logic
def _format_response(reasoning_text, msg_text, msg_content, wrap_tag):
    # Handle wrapping and combining the partsProvide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
| """ | ||
| Extracts the content from the message, including reasoning if available. | ||
| It wraps the reasoning around <think>...</think> for easy identification of reasoning content, | ||
| when LLM produces 'text' and 'reasoning' in the same message. | ||
| 
      Comment on lines
    
      +602
     to 
      +606
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing response parameter structure documentation Tell me moreWhat is the issue?The docstring lacks information about the expected format/structure of the 'response' parameter. Why this mattersWithout clear documentation of the expected response structure, developers may pass incorrectly formatted responses leading to runtime errors. Suggested change ∙ Feature Previewdef _extract_thinking_content_from_response(response, wrap_tag: str = "think"): Provide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
|  | ||
| Args: | ||
| response: The message object or dict containing content and reasoning. | ||
| wrap_tag: The tag name to wrap reasoning content (default: "think"). | ||
|  | ||
| Returns: | ||
| str: The extracted content with reasoning wrapped in specified tags. | ||
| """ | ||
| # Normalize to dict | ||
| message = response.choices[0].message | ||
| if not isinstance(message, dict): | ||
| message = message.to_dict() | ||
|  | ||
| # --- Extract reasoning from either `reasoning` or `reasoning_content` (and optional metadata) --- | ||
| reasoning_text = ( | ||
| message.get("reasoning") | ||
| or message.get("reasoning_content") | ||
| ) | ||
|  | ||
| # --- Extract surface text/content (keeps your original behavior, but handles list-style `content`) --- | ||
| msg_text = message.get("text", "") # works for OpenRouter | ||
| raw_content = message.get("content", "") | ||
|  | ||
| if isinstance(raw_content, list): | ||
| # Concatenate text-like parts if provider returns content blocks | ||
| parts = [] | ||
| for part in raw_content: | ||
| if isinstance(part, str): | ||
| parts.append(part) | ||
| elif isinstance(part, dict): | ||
| # Common shapes: {"type":"text","text":"..."} or {"type":"text","text":{"value":"..."}} | ||
| if part.get("type") == "text": | ||
| txt = part.get("text") | ||
| if isinstance(txt, dict) and isinstance(txt.get("value"), str): | ||
| parts.append(txt["value"]) | ||
| elif isinstance(txt, str): | ||
| parts.append(txt) | ||
| else: | ||
| # Fallback: try a few likely keys | ||
| for k in ("content", "text", "value"): | ||
| v = part.get(k) | ||
| if isinstance(v, str): | ||
| parts.append(v) | ||
| break | ||
| else: | ||
| parts.append(str(part)) | ||
| msg_content = "\n".join(p for p in parts if p) | ||
| 
      Comment on lines
    
      +630
     to 
      +653
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inefficient nested iteration in content processing Tell me moreWhat is the issue?Inefficient nested loops and repeated dictionary lookups when processing list-style content. Why this mattersThe nested loops with multiple dictionary get() calls and string type checks create O(n*m) complexity where n is the number of parts and m is the number of fallback keys. This will cause performance degradation when processing large content lists. Suggested change ∙ Feature PreviewExtract the text extraction logic into a separate function and use early returns to avoid nested loops: def _extract_text_from_part(part):
    if isinstance(part, str):
        return part
    if isinstance(part, dict):
        if part.get("type") == "text":
            txt = part.get("text")
            if isinstance(txt, dict):
                return txt.get("value", "")
            return txt or ""
        # Single loop for fallback keys
        for key in ("content", "text", "value"):
            value = part.get(key)
            if isinstance(value, str):
                return value
    return str(part)
# Then use: parts = [_extract_text_from_part(part) for part in raw_content]Provide feedback to improve future suggestions💬 Looking for more details? Reply to this comment to chat with Korbit. | ||
| else: | ||
| msg_content = raw_content if isinstance(raw_content, str) else str(raw_content or "") | ||
|  | ||
| # --- Wrap reasoning if present --- | ||
| if reasoning_text: | ||
| reasoning_wrapped = f"<{wrap_tag}>{reasoning_text}</{wrap_tag}>\n" if wrap_tag else (reasoning_text + "\n") | ||
| logging.debug("Extracting content from response.choices[i].message.(reasoning|reasoning_content)") | ||
| else: | ||
| reasoning_wrapped = "" | ||
|  | ||
| return f"{reasoning_wrapped}{msg_text}{msg_content}" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-place mutation of observation dictionary
Tell me more
What is the issue?
The code modifies the observation dictionary in-place during object initialization, which could cause side effects if the same observation dictionary is used elsewhere or reused.
Why this matters
This mutation could lead to unexpected behavior in other parts of the system that expect the original unmodified observation data, potentially causing debugging difficulties and inconsistent state.
Suggested change ∙ Feature Preview
Create a copy of the observation or modify the axtree_txt after extraction:
Or modify it in the AXTree constructor:
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.