Fixes to llm output parsing when using LLM based ranking #16
+259
−47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to schema item filtering, LLM output sanitization, and prompt clarity for schema ranking. The most important changes are the addition of robust parsing and validation utilities for schema items, enhanced pre-processing and extraction of LLM-generated JSON, and improved handling of invalid LLM outputs. The schema ranking prompt has also been rewritten for clarity and strict output requirements.
Schema item parsing and filtering improvements:
_parse_schema_item,_parse_column_ref,_get_foreign_key) to robustly parse and validate schema item references, and to extract foreign key relationships insrc/pipe/add_schema.py. This ensures only valid columns and their foreign keys are included during schema filtering.filter_schemafunction to use these utilities, improving the reliability of schema filtering and handling edge cases.LLM output sanitization and extraction:
_preprocess_json_stringinsrc/pipe/llm_util.pyto fix common LLM formatting errors in JSON output, such as array termination, missing quotes, and empty items, before parsing.extract_jsonto pre-process LLM output before parsing, improving robustness against malformed JSON.extract_objectto debug level for failed extractions, reducing noise in production logs.Schema ranking and prompt handling:
_sanitize_schema_itemand improved_process_outputinsrc/pipe/rank_schema_llm.pyto validate and sanitize LLM outputs, fallback to all schema items if output is invalid, and log warnings when necessary.src/pipe/rank_schema_prompts/v1.pyto clarify input/output formats, requirements, and examples, ensuring LLMs return strictly valid and relevant schema items.