-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL Update CHUNK to support chunking_settings as optional argument #138123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ES|QL Update CHUNK to support chunking_settings as optional argument #138123
Conversation
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far LGTM - I think the options validation is correct (missing some tests) but the direction is good!
As a drive-by comment - I find a bit confusing having to add nested options when we have a single other param (num_chunks):
CHUNK(content, {"num_chunks": 3, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 } })
}
Wouldn't it be better to avoid having to specify chunking_settings as an additional option, and just flatten the chunking settings into the overall options?
CHUNK(content, {"num_chunks": 3, "strategy": "sentence", "max_chunk_size": 50, "sentence_overlap": 0 })
}
We can do the same chunking building if we remove the num_chunks from the options map and then try to build from there 🤷
|
|
||
| @Evaluator(extraName = "BytesRef") | ||
| static void process(BytesRefBlock.Builder builder, BytesRef str, int numChunks, int chunkSize) { | ||
| static void process(BytesRefBlock.Builder builder, BytesRef str, int numChunks, @Fixed ChunkingSettings chunkingSettings) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add @Fixed to numChunks as well to avoid it being an IntVector in the generated evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check the validation for the ChunkingSettings (that an option is the type it should be etc - I guess you'll be including that after initial validation 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking that once we have a map, additional validation on valid chunking settings can be handled by the chunking settings builder. We would still throw an error if invalid chunking settings were passed in but this is more robust to changes to our chunking settings - for example adding new strategies or options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, makes sense! Then it will be enough to do a test with an incorrect setting and ensure we catch it 👍 . We can do that as part of the VerifierTests I believe.
@carlosdelest So I get that, but here's the rub - various chunking settings strategies have different options. For example, I realize the nested syntax is confusing but I'm just not sure if there's a better way to do that. We can optimize for |
@kderusso I'm not sure that I understand. Can't we take this option map:
and just send it to the |
I suppose that's an option, to send in all other options as a big bag'o'options, but I'm not thrilled with that for future API extensibility - if we ever add additional options like |
Updates
CHUNKto supportchunking_settingsas an optional argument. Replacesnum_chunks.Usage: