文本划分是否考虑哪些片段属于同一篇文档？ #3057

Shaky-00 · 2024-02-23T07:24:15Z

Shaky-00
Feb 23, 2024

从流程图看起来像是直接对文档进行切分成chunk，然后embedding后和query的向量化结果去做相似性匹配。

想知道这个过程是否没有考虑哪些文段属于同一篇文档？还是说把它们都切碎了混在一起？

@Shaky-00 · 2024-02-23T07:24:28Z

dosubot[bot]
bot Feb 23, 2024

🤖

嘿，@Shaky-00！很高兴再次看到你的提问 😊🚀

在Langchain-Chatchat的处理过程中，对于长文本的处理，是通过切分并混合在一起进行相似性匹配的。这一点可以从代码中的metaphor_search函数中看出。在这个函数中，如果split_result参数为True，那么会使用RecursiveCharacterTextSplitter对文档进行切分，并将切分后的文档放入临时向量库，然后重新筛选出TOP_K个文档。

以下是相关的代码片段：

def metaphor_search(
        text: str,
        result_len: int = SEARCH_ENGINE_TOP_K,
        split_result: bool = False,
        chunk_size: int = 500,
        chunk_overlap: int = OVERLAP_SIZE,
) -> List[Dict]:
    ...
    # metaphor 返回的内容都是长文本，需要分词再检索
    if split_result:
        docs = [Document(page_content=x.extract,
                         metadata={"link": x.url, "title": x.title})
                for x in contents]
        text_splitter = RecursiveCharacterTextSplitter(["\n\n", "\n", ".", " "],
                                                       chunk_size=chunk_size,
                                                       chunk_overlap=chunk_overlap)
        splitted_docs = text_splitter.split_documents(docs)

        # 将切分好的文档放入临时向量库，重新筛选出TOP_K个文档
        if len(splitted_docs) > result_len:
            normal = NormalizedLevenshtein()
            for x in splitted_docs:
                x.metadata["score"] = normal.similarity(text, x.page_content)
            splitted_docs.sort(key=lambda x: x.metadata["score"], reverse=True)
            splitted_docs = splitted_docs[:result_len]

        docs = [{"snippet": x.page_content,
                 "link": x.metadata["link"],
                 "title": x.metadata["title"]}
                for x in splitted_docs]
    ...

这段代码来自server/chat/search_engine_chat.py。

所以，对于你的问题，Langchain-Chatchat在处理长文本时，确实是将其切分并混合在一起进行相似性匹配，而没有考虑哪些文本片段属于同一篇文档。

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

文本划分是否考虑哪些片段属于同一篇文档？ #3057

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

About Dosu

Select a reply

Uh oh!

文本划分是否考虑哪些片段属于同一篇文档？ #3057

Uh oh!

Shaky-00 Feb 23, 2024

Replies: 1 comment

Uh oh!

Uh oh!

dosubot[bot] bot Feb 23, 2024

Sources

About Dosu

Shaky-00
Feb 23, 2024

dosubot[bot]
bot Feb 23, 2024