Fixes to llm output parsing when using LLM based ranking #16

amrit110 · 2025-11-06T19:44:27Z

This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.

Schema item parsing and filtering improvements:

Added utility functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships in src/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.
Refactored the filter_schema function to use these utilities, improving the reliability of schema filtering and handling edge cases.

LLM output sanitization and extraction:

Introduced _preprocess_json_string in src/pipe/llm_util.py to fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.
Updated extract_json to pre-process LLM output before parsing, improving robustness against malformed JSON.
Changed logging in extract_object to debug level for failed extractions, reducing noise in production logs.

Schema ranking and prompt handling:

Added _sanitize_schema_item and improved _process_output in src/pipe/rank_schema_llm.py to validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.
Rewrote the schema ranking prompt in src/pipe/rank_schema_prompts/v1.py to clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.

Fixes to llm output parsing when using LLM based ranking

7fd3ba6

amrit110 self-assigned this Nov 6, 2025

amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025

amrit110 marked this pull request as draft November 13, 2025 19:09

amrit110 added 2 commits November 17, 2025 09:02

Revert change to main.py

a464817

Merge branch 'main' into fix_llm_ranking

4eb2fdd

amrit110 requested a review from sepideh-abedini November 17, 2025 14:03

amrit110 marked this pull request as ready for review November 17, 2025 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to llm output parsing when using LLM based ranking #16

Fixes to llm output parsing when using LLM based ranking #16

amrit110 commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes to llm output parsing when using LLM based ranking #16

Are you sure you want to change the base?

Fixes to llm output parsing when using LLM based ranking #16

Conversation

amrit110 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amrit110 commented Nov 6, 2025 •

edited

Loading