-
Notifications
You must be signed in to change notification settings - Fork 96
feat: add code chunking functionality #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
❌ DCO Check Failed Hi @bridgetmcg, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for Bridget McGinn <bridget.mcginn@ibm.com>
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a4edf0c9a8511a0e3cd31acda2df335d62
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 46bb88ae3a00ebbdd33336e51b51e030c748425b"
git push 🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-lease For multiple commits: git rebase --signoff origin/main
git push --force-with-lease More info: DCO check report |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Related Documentation Checked 2 published document(s). No updates required. You have 5 draft document(s). Publish docs to keep them always up-to-date |
I, Bridget McGinn <bridget.mcginn@ibm.com>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <bridget.mcginn@ibm.com>
5c6fc1e
to
32b120d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:
I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:
- to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
- to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the
HierarchicalChunker
(good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into theHybridChunker
). Let us still define the interface specifics, but it could look like an optional kwarg inHierarchicalChunker.chunk()
, e.g.code_chunking_strategy
, adhering to a matching interface.
A more minor comment is regarding CodeDocMeta
, which I see inherits from BaseMeta
. Some application code may expect to interact with DocMeta
(also a BaseMeta
child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.
(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)
hi @vagenas, I added some logic to address your suggestions. |
This PR introduces code chunking functionality to
docling-core
, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.Features
Core
Language Support
Testing