Skip to content

Conversation

@amrit110
Copy link
Member

@amrit110 amrit110 commented Nov 6, 2025

This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.

Schema item parsing and filtering improvements:

  • Added utility functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships in src/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.
  • Refactored the filter_schema function to use these utilities, improving the reliability of schema filtering and handling edge cases.

LLM output sanitization and extraction:

  • Introduced _preprocess_json_string in src/pipe/llm_util.py to fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.
  • Updated extract_json to pre-process LLM output before parsing, improving robustness against malformed JSON.
  • Changed logging in extract_object to debug level for failed extractions, reducing noise in production logs.

Schema ranking and prompt handling:

  • Added _sanitize_schema_item and improved _process_output in src/pipe/rank_schema_llm.py to validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.
  • Rewrote the schema ranking prompt in src/pipe/rank_schema_prompts/v1.py to clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.

@amrit110 amrit110 self-assigned this Nov 6, 2025
@amrit110 amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025
@amrit110 amrit110 marked this pull request as draft November 13, 2025 19:09
@amrit110 amrit110 marked this pull request as ready for review November 17, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants