Discussion on `EmbeddingBasedDocumentSplitter` component #356

davidsbatista · 2025-08-13T11:52:41Z

davidsbatista
Aug 13, 2025
Maintainer

This is the discussion board for EmbeddingBasedDocumentSplitter component.

We introduced the QueryExpander component, which generates a list of semantically similar queries to the user query to improve retrieval recall in RAG.

Usage example

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack_experimental.components.preprocessors import EmbeddingBasedDocumentSplitter
doc = Document(
    content="This is a first sentence. This is a second sentence. This is a third sentence. "
    "Completely different topic. The same completely different topic."
)
embedder = SentenceTransformersDocumentEmbedder()
splitter = EmbeddingBasedDocumentSplitter(
    document_embedder=embedder,
    sentences_per_group=2,
    percentile=0.95,
    min_length=50,
    max_length=1000
)
splitter.warm_up()
result = splitter.run(documents=[doc])

Try it out and let us know what you think in the comments 👇

Note: Experimental features live in this repository for a fixed period of time. We don't guarantee that we will continue maintaining experimental features. But if they are successful, stable and if you like it, we will move the feature to the core Haystack package.

d-kleine · 2025-09-22T15:16:36Z

d-kleine
Sep 22, 2025

Hi @davidsbatista,

I just tested it, and I would like share some feedback:

One small observation: I have freshly installed haystack (current release) along with haystack-experimental (from main) in a new env, but after that I needed to run pip install nltk>=3.9.1 sentence-transformers>=4.1.0 to run the code. It said something that these packages are optional for haystack, just fyi.
Could you please help me understand the motivation behind the new EmbeddingBasedDocumentSplitter component? What problems it does it aim to solve and what it is particularly good for?

0 replies

davidsbatista · 2025-09-23T12:00:02Z

davidsbatista
Sep 23, 2025
Maintainer Author

Hi @d-kleine! Thanks for your feedback! Appreciated! :)

One small observation: I have freshly installed haystack (current release) along with haystack-experimental (from main) in a new env, but after that I needed to run pip install nltk>=3.9.1 sentence-transformers>=4.1.0 to run the code. It said something that these packages are optional for haystack, just fyi.

That's how it should work, although when we move it to haystack main package I should add the LazyImport warning to be a clean message. I noticed that's not being used in EmbeddingBasedDocumentSplitter

the motivation behind the new EmbeddingBasedDocumentSplitter component?

It provides another way to split your large document into semantically meaningful, smaller documents. Instead of splitting by sections, paragraphs, or sentences one tries make a split where the semantics of the text diverge based on comparing embeddings.

1 reply

d-kleine Sep 23, 2025

It provides another way to split your large document into semantically meaningful, smaller documents. Instead of splitting by sections, paragraphs, or sentences one tries make a split where the semantics of the text diverge based on comparing embeddings.

I see, that was my guess too 🙂 I read the docstring and was a bit unsure, maybe it would be useful to add its "aim" there too.

So, I would like to share some of my thoughts:

Overall, easy to understand, use and handle
params also good, but I could not understand the aim of use_split_rules and extend_abbreviations
I think it is a great addition to the other doc splitters (DocumentSplitter, etc.)
Regarding the example code, I think it would be useful for the docs to add a brief explanation to the output to make the functionality/usage clearer.
Currently, the documents are split to a fixed length. Maybe it would be useful to add another setting, splitting the documents into meaningful chunks to control the number of generated documents after the split? (I’m happy to go into more detail if you want)

davidsbatista · 2025-09-24T08:09:19Z

davidsbatista
Sep 24, 2025
Maintainer Author

Hi again @d-kleine - and thank you for all the good feeback, I've open a PR to improve the docstrings and the code example, thanks for pointing that out, I hope this makes it easier for users seeing this component for the first time.

Regarding the last comment:

Currently, the documents are split to a fixed length. Maybe it would be useful to add another setting, splitting the documents into meaningful chunks to control the number of generated documents after the split? (I’m happy to go into more detail if you want)

Can you detail a bit more your suggestion, maybe with an example?

3 replies

d-kleine Sep 24, 2025

Can you detail a bit more your suggestion, maybe with an example?

Currently, a document will be split up into an undefined number of semantically meaningful, smaller documents. But what if you'd like to have a certain number of smaller documents? For example, I would have thought of a transcription of an interview where you know the topics that have been discussed, usually in a specific order, that not has been labeled yet. Let's say there were three topics discussed in the transcription: politics, sports, culture – it would be great to split up the transcription into 3 meaningful parts.

Just an outside-the-box idea, idk if this might be an edge case or out-of-scope 😅

davidsbatista Sep 24, 2025
Maintainer Author

Ah I see - we never had such a use case, interesting, somehow clustering comes to mind, anyway I will keep that in mind when we work again on the Document splitters, or if you feel you have a specific use case, let us know about it and open an issue.

d-kleine Sep 24, 2025

Yeah, I just wanted to mention that came into my mind when trying out, but might be a use case for another document splitter. Anyways, great feature! 👍🏻🙂

Discussion on EmbeddingBasedDocumentSplitter component #356

Uh oh!

Uh oh!

davidsbatista Aug 13, 2025 Maintainer

Usage example

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

d-kleine Sep 22, 2025

Uh oh!

davidsbatista Sep 23, 2025 Maintainer Author

Uh oh!

Uh oh!

d-kleine Sep 23, 2025

Uh oh!

davidsbatista Sep 24, 2025 Maintainer Author

Uh oh!

d-kleine Sep 24, 2025

Uh oh!

davidsbatista Sep 24, 2025 Maintainer Author

Uh oh!

d-kleine Sep 24, 2025

Discussion on `EmbeddingBasedDocumentSplitter` component #356

davidsbatista
Aug 13, 2025
Maintainer

Replies: 3 comments 4 replies

d-kleine
Sep 22, 2025

davidsbatista
Sep 23, 2025
Maintainer Author

davidsbatista
Sep 24, 2025
Maintainer Author

davidsbatista Sep 24, 2025
Maintainer Author