Skip to content

Conversation

bridgetmcg
Copy link

This PR introduces code chunking functionality to docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.

Features

Core

  • CodeChunker - Base abstract class for code chunking with Tree-sitter integration
  • Language-specific chunkers - Specialized implementations for 5 major programming languages
  • Smart chunk splitting - Automatic splitting of large functions while preserving context

Language Support

  • Python (PythonFunctionChunker) - Functions, classes, imports, module variables
  • TypeScript (TypeScriptFunctionChunker) - Functions, classes, interfaces, imports
  • JavaScript (JavaScriptFunctionChunker) - Inherits from TypeScript chunker
  • Java (JavaFunctionChunker) - Methods, constructors, classes, enums, interfaces
  • C (CFunctionChunker) - Functions, structs, macros, preprocessor definitions

Testing

  • test_code_chunker.py - multi-language, real code samples

Copy link
Contributor

github-actions bot commented Oct 3, 2025

DCO Check Failed

Hi @bridgetmcg, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for Bridget McGinn <bridget.mcginn@ibm.com>

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a4edf0c9a8511a0e3cd31acda2df335d62
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 46bb88ae3a00ebbdd33336e51b51e030c748425b"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

Copy link

mergify bot commented Oct 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Copy link

dosubot bot commented Oct 3, 2025

Related Documentation

Checked 2 published document(s). No updates required.

You have 5 draft document(s). Publish docs to keep them always up-to-date

How did I do? Any feedback?  Join Discord

I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a

Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
@bridgetmcg bridgetmcg changed the title Add Code Chunking Functionality feat: add code chunking functionality Oct 3, 2025
Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:

I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:

  • to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
  • to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the HierarchicalChunker (good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into the HybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg in HierarchicalChunker.chunk(), e.g. code_chunking_strategy, adhering to a matching interface.

A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.

(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)

@bridgetmcg
Copy link
Author

bridgetmcg commented Oct 10, 2025

hi @vagenas, I added some logic to address your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants