fix: glob mapping #430

danielfrey63 · 2025-12-30T22:34:22Z

Summary

This PR makes the backend repository ingestion (api/data_pipeline.py → read_all_documents) correctly interpret included_files / excluded_files as glob patterns (e.g. *.md, packages/*/dist) in a predictable, cross-platform way.

Previously, patterns in file filter configuration could behave inconsistently because matching was effectively done as plain filename equality in some cases. That led to unexpected files being embedded/indexed, increasing processing time and cost, and reducing retrieval quality.

Problem / Motivation

api/config/repo.json contains many ignore patterns that are clearly intended to be globs (e.g. *.min.js, packages/*/dist, etc.). Without consistent glob matching:

Files that should be excluded (e.g. Markdown docs via *.md) may still be processed.
Inclusion-only mode via included_files does not behave as users expect.
Filtering can differ depending on OS path separators (\ vs /) and whether the code checks filenames vs relative paths.

What Changed

1) Glob-aware matching in api/data_pipeline.py

Inside read_all_documents, the helper should_process_file(...) now:

Normalizes paths using os.path.normpath
Computes:
- file_name (basename)
- rel_path_norm (normalized relative path)
Introduces glob-based matching via fnmatch.fnmatchcase(...) against:
- the filename (README.md)
- the normalized relative path (docs/README.md, packages/foo/dist/index.js)

This makes included_files / excluded_files behave like proper glob filters rather than requiring exact filename matches.

2) New unit tests

Added tests/unit/test_file_filters_glob.py to verify the intended behavior:

Exclusion: excluded_files=["*.md"] excludes README.md while keeping other files.
Inclusion: included_files=["*.md"] includes only Markdown files.

Tests run in a temporary directory to stay deterministic and independent of repo contents.

Why This Approach

No new dependencies: Uses Python stdlib fnmatch.
Simple + readable: Keeps logic local to read_all_documents and avoids over-engineering.
Cross-platform consistency: Path normalization prevents Windows separator issues.

Behavior / Compatibility Notes

No API changes.
Existing configuration files benefit immediately.
Filtering is now consistent for patterns matching either:
- filenames, or
- relative paths (recommended for nested patterns like packages/*/dist).

Testing

Added unit tests: tests/unit/test_file_filters_glob.py

Suggested Manual Smoke Test (Optional)

Run ingestion on a sample repo with:
- excluded_files=["*.md"] and confirm Markdown files are skipped
- included_files=["*.md"] and confirm only Markdown files are processed

gemini-code-assist · 2025-12-30T22:34:37Z

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the file filtering mechanism within the data ingestion pipeline to accurately process glob patterns. By integrating robust glob matching, it resolves previous issues where file inclusion and exclusion rules were inconsistently applied, leading to more precise data processing, reduced overhead, and improved retrieval quality.

Highlights

Consistent Glob Pattern Matching: The backend repository ingestion now correctly interprets "included_files" and "excluded_files" as glob patterns (e.g., ".md", "packages//dist") using "fnmatch.fnmatchcase" for predictable, cross-platform behavior.
Improved File Filtering Logic: The "should_process_file" helper function in "api/data_pipeline.py" has been updated to normalize paths and apply glob matching against both the filename and the normalized relative path, resolving previous inconsistencies.
New Unit Tests: Dedicated unit tests have been added in "tests/unit/test_file_filters_glob.py" to verify the correct exclusion and inclusion of files based on glob patterns, ensuring the intended behavior.
No New External Dependencies: The solution leverages Python's standard library "fnmatch" module, avoiding the introduction of any new external dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly implements glob pattern matching for file inclusion and exclusion, which is a significant improvement over the previous string matching logic. The use of fnmatch and path normalization ensures consistent, cross-platform behavior.

I've provided a few suggestions:

Refactoring the matches_any_glob helper for better conciseness and robustness.
Expanding the new unit tests to cover path-based glob patterns, which are mentioned in the PR description but not yet tested.
Moving the unrelated change to the text splitter (CodeAwareSplitter) to a separate PR to keep this one focused.

Overall, this is a valuable fix. Addressing these points will make it even better.

api/data_pipeline.py

tests/unit/test_file_filters_glob.py

fix: glob mapping

bb0825c

gemini-code-assist bot reviewed Dec 30, 2025

View reviewed changes

api/data_pipeline.py Show resolved Hide resolved

api/data_pipeline.py Outdated Show resolved Hide resolved

tests/unit/test_file_filters_glob.py Show resolved Hide resolved

fix: added more tests and removed code splitter

1d28ea3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: glob mapping #430

fix: glob mapping #430

Uh oh!

danielfrey63 commented Dec 30, 2025

Uh oh!

gemini-code-assist bot commented Dec 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: glob mapping #430

Are you sure you want to change the base?

fix: glob mapping #430

Uh oh!

Conversation

danielfrey63 commented Dec 30, 2025

Summary

Problem / Motivation

What Changed

1) Glob-aware matching in api/data_pipeline.py

2) New unit tests

Why This Approach

Behavior / Compatibility Notes

Testing

Suggested Manual Smoke Test (Optional)

Uh oh!

gemini-code-assist bot commented Dec 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant