Replies: 1 comment
-
🤖 嘿,@Shaky-00!很高兴再次看到你的提问 😊🚀 在Langchain-Chatchat的处理过程中,对于长文本的处理,是通过切分并混合在一起进行相似性匹配的。这一点可以从代码中的 以下是相关的代码片段: def metaphor_search(
text: str,
result_len: int = SEARCH_ENGINE_TOP_K,
split_result: bool = False,
chunk_size: int = 500,
chunk_overlap: int = OVERLAP_SIZE,
) -> List[Dict]:
...
# metaphor 返回的内容都是长文本,需要分词再检索
if split_result:
docs = [Document(page_content=x.extract,
metadata={"link": x.url, "title": x.title})
for x in contents]
text_splitter = RecursiveCharacterTextSplitter(["\n\n", "\n", ".", " "],
chunk_size=chunk_size,
chunk_overlap=chunk_overlap)
splitted_docs = text_splitter.split_documents(docs)
# 将切分好的文档放入临时向量库,重新筛选出TOP_K个文档
if len(splitted_docs) > result_len:
normal = NormalizedLevenshtein()
for x in splitted_docs:
x.metadata["score"] = normal.similarity(text, x.page_content)
splitted_docs.sort(key=lambda x: x.metadata["score"], reverse=True)
splitted_docs = splitted_docs[:result_len]
docs = [{"snippet": x.page_content,
"link": x.metadata["link"],
"title": x.metadata["title"]}
for x in splitted_docs]
... 这段代码来自server/chat/search_engine_chat.py。 所以,对于你的问题,Langchain-Chatchat在处理长文本时,确实是将其切分并混合在一起进行相似性匹配,而没有考虑哪些文本片段属于同一篇文档。 Sources
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
从流程图看起来像是直接对文档进行切分成chunk,然后embedding后和query的向量化结果去做相似性匹配。
想知道这个过程是否没有考虑哪些文段属于同一篇文档?还是说把它们都切碎了混在一起?
Beta Was this translation helpful? Give feedback.
All reactions