Skip to content

Conversation

@danielfrey63
Copy link
Contributor

Summary

This PR makes the backend repository ingestion (api/data_pipeline.pyread_all_documents) correctly interpret included_files / excluded_files as glob patterns (e.g. *.md, packages/*/dist) in a predictable, cross-platform way.

Previously, patterns in file filter configuration could behave inconsistently because matching was effectively done as plain filename equality in some cases. That led to unexpected files being embedded/indexed, increasing processing time and cost, and reducing retrieval quality.

Problem / Motivation

api/config/repo.json contains many ignore patterns that are clearly intended to be globs (e.g. *.min.js, packages/*/dist, etc.). Without consistent glob matching:

  • Files that should be excluded (e.g. Markdown docs via *.md) may still be processed.
  • Inclusion-only mode via included_files does not behave as users expect.
  • Filtering can differ depending on OS path separators (\ vs /) and whether the code checks filenames vs relative paths.

What Changed

1) Glob-aware matching in api/data_pipeline.py

Inside read_all_documents, the helper should_process_file(...) now:

  • Normalizes paths using os.path.normpath
  • Computes:
    • file_name (basename)
    • rel_path_norm (normalized relative path)
  • Introduces glob-based matching via fnmatch.fnmatchcase(...) against:
    • the filename (README.md)
    • the normalized relative path (docs/README.md, packages/foo/dist/index.js)

This makes included_files / excluded_files behave like proper glob filters rather than requiring exact filename matches.

2) New unit tests

Added tests/unit/test_file_filters_glob.py to verify the intended behavior:

  • Exclusion: excluded_files=["*.md"] excludes README.md while keeping other files.
  • Inclusion: included_files=["*.md"] includes only Markdown files.

Tests run in a temporary directory to stay deterministic and independent of repo contents.

Why This Approach

  • No new dependencies: Uses Python stdlib fnmatch.
  • Simple + readable: Keeps logic local to read_all_documents and avoids over-engineering.
  • Cross-platform consistency: Path normalization prevents Windows separator issues.

Behavior / Compatibility Notes

  • No API changes.
  • Existing configuration files benefit immediately.
  • Filtering is now consistent for patterns matching either:
    • filenames, or
    • relative paths (recommended for nested patterns like packages/*/dist).

Testing

  • Added unit tests: tests/unit/test_file_filters_glob.py

Suggested Manual Smoke Test (Optional)

  • Run ingestion on a sample repo with:
    • excluded_files=["*.md"] and confirm Markdown files are skipped
    • included_files=["*.md"] and confirm only Markdown files are processed

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @danielfrey63, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the file filtering mechanism within the data ingestion pipeline to accurately process glob patterns. By integrating robust glob matching, it resolves previous issues where file inclusion and exclusion rules were inconsistently applied, leading to more precise data processing, reduced overhead, and improved retrieval quality.

Highlights

  • Consistent Glob Pattern Matching: The backend repository ingestion now correctly interprets "included_files" and "excluded_files" as glob patterns (e.g., ".md", "packages//dist") using "fnmatch.fnmatchcase" for predictable, cross-platform behavior.
  • Improved File Filtering Logic: The "should_process_file" helper function in "api/data_pipeline.py" has been updated to normalize paths and apply glob matching against both the filename and the normalized relative path, resolving previous inconsistencies.
  • New Unit Tests: Dedicated unit tests have been added in "tests/unit/test_file_filters_glob.py" to verify the correct exclusion and inclusion of files based on glob patterns, ensuring the intended behavior.
  • No New External Dependencies: The solution leverages Python's standard library "fnmatch" module, avoiding the introduction of any new external dependencies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements glob pattern matching for file inclusion and exclusion, which is a significant improvement over the previous string matching logic. The use of fnmatch and path normalization ensures consistent, cross-platform behavior.

I've provided a few suggestions:

  • Refactoring the matches_any_glob helper for better conciseness and robustness.
  • Expanding the new unit tests to cover path-based glob patterns, which are mentioned in the PR description but not yet tested.
  • Moving the unrelated change to the text splitter (CodeAwareSplitter) to a separate PR to keep this one focused.

Overall, this is a valuable fix. Addressing these points will make it even better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant