Skip to content

Conversation

@kderusso
Copy link
Member

@kderusso kderusso commented Nov 14, 2025

Updates CHUNK to support chunking_settings as an optional argument. Replaces num_chunks.

Usage:

CHUNK(content, {"num_chunks": 3, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 } })
}

@github-actions
Copy link
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far LGTM - I think the options validation is correct (missing some tests) but the direction is good!

As a drive-by comment - I find a bit confusing having to add nested options when we have a single other param (num_chunks):

CHUNK(content, {"num_chunks": 3, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 } })
}

Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?

CHUNK(content, {"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 })
}

We can do the same chunking building if we remove the num_chunks from the options map and then try to build from there 🤷


@Evaluator(extraName = "BytesRef")
static void process(BytesRefBlock.Builder builder, BytesRef str, int numChunks, int chunkSize) {
static void process(BytesRefBlock.Builder builder, BytesRef str, int numChunks, @Fixed ChunkingSettings chunkingSettings) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add @Fixed to numChunks as well to avoid it being an IntVector in the generated evaluator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check the validation for the ChunkingSettings (that an option is the type it should be etc - I guess you'll be including that after initial validation 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking that once we have a map, additional validation on valid chunking settings can be handled by the chunking settings builder. We would still throw an error if invalid chunking settings were passed in but this is more robust to changes to our chunking settings - for example adding new strategies or options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, makes sense! Then it will be enough to do a test with an incorrect setting and ensure we catch it 👍 . We can do that as part of the VerifierTests I believe.

@kderusso
Copy link
Member Author

Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?

@carlosdelest So I get that, but here's the rub - various chunking settings strategies have different options. For example, sentence based chunking requires sentence_overlap but word based chunking requires overlap. And none would have no overlap. So if we were to flatten out these options, we'd have to take every possible permutation into account and commit to updating it going forward. The nested options felt better to me, because we can defer to the chunking settings builder to validate the options. This is the strategy we went with for semantic_text fields and it works well, because when new chunking settings are added they seamlessly work.

I realize the nested syntax is confusing but I'm just not sure if there's a better way to do that.

We can optimize for semantic_text fields and have reasonable default chunking strategies, and we've discussed whether we want to use an inference ID to pull chunking settings (this has some complications because of jarhell issues with inference and esql). But this felt like the first step, because it offers that flexibility. WDYT?

@carlosdelest
Copy link
Member

various chunking settings strategies have different options. For example, sentence based chunking requires sentence_overlap but word based chunking requires overlap. And none would have no overlap. So if we were to flatten out these options, we'd have to take every possible permutation into account and commit to updating it going forward.

@kderusso I'm not sure that I understand. Can't we take this option map:

{"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0}

and just send it to the ChunkingSettingsBuilder, removing the num_chunks from it?

@kderusso
Copy link
Member Author

@kderusso I'm not sure that I understand. Can't we take this option map:

{"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0}

and just send it to the ChunkingSettingsBuilder, removing the num_chunks from it?

I suppose that's an option, to send in all other options as a big bag'o'options, but I'm not thrilled with that for future API extensibility - if we ever add additional options like super_excellent_widget then it really complicates everything? Maybe I'm overthinking it, it's a good discussion - would like to hear from @ioanatia on this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants