-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: glob mapping #430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: glob mapping #430
Conversation
Summary of ChangesHello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the file filtering mechanism within the data ingestion pipeline to accurately process glob patterns. By integrating robust glob matching, it resolves previous issues where file inclusion and exclusion rules were inconsistently applied, leading to more precise data processing, reduced overhead, and improved retrieval quality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly implements glob pattern matching for file inclusion and exclusion, which is a significant improvement over the previous string matching logic. The use of fnmatch and path normalization ensures consistent, cross-platform behavior.
I've provided a few suggestions:
- Refactoring the
matches_any_globhelper for better conciseness and robustness. - Expanding the new unit tests to cover path-based glob patterns, which are mentioned in the PR description but not yet tested.
- Moving the unrelated change to the text splitter (
CodeAwareSplitter) to a separate PR to keep this one focused.
Overall, this is a valuable fix. Addressing these points will make it even better.
Summary
This PR makes the backend repository ingestion (api/data_pipeline.py → read_all_documents) correctly interpret
included_files/excluded_filesas glob patterns (e.g.*.md,packages/*/dist) in a predictable, cross-platform way.Previously, patterns in file filter configuration could behave inconsistently because matching was effectively done as plain filename equality in some cases. That led to unexpected files being embedded/indexed, increasing processing time and cost, and reducing retrieval quality.
Problem / Motivation
api/config/repo.json contains many ignore patterns that are clearly intended to be globs (e.g.
*.min.js,packages/*/dist, etc.). Without consistent glob matching:*.md) may still be processed.included_filesdoes not behave as users expect.\vs/) and whether the code checks filenames vs relative paths.What Changed
1) Glob-aware matching in api/data_pipeline.py
Inside read_all_documents, the helper should_process_file(...) now:
os.path.normpathfile_name(basename)rel_path_norm(normalized relative path)fnmatch.fnmatchcase(...)against:docs/README.md,packages/foo/dist/index.js)This makes
included_files/excluded_filesbehave like proper glob filters rather than requiring exact filename matches.2) New unit tests
Added
tests/unit/test_file_filters_glob.pyto verify the intended behavior:excluded_files=["*.md"]excludesREADME.mdwhile keeping other files.included_files=["*.md"]includes only Markdown files.Tests run in a temporary directory to stay deterministic and independent of repo contents.
Why This Approach
fnmatch.Behavior / Compatibility Notes
packages/*/dist).Testing
tests/unit/test_file_filters_glob.pySuggested Manual Smoke Test (Optional)
excluded_files=["*.md"]and confirm Markdown files are skippedincluded_files=["*.md"]and confirm only Markdown files are processed