Skip to content

Conversation

Mermaid97
Copy link
Contributor

✨ Knowledge Base Summary Development
一个基于文档级聚类和层级总结的知识库智能总结方法
可以正确分点,完成内容正确划分,目前是内容的总结,也可以加上一些分析。
image

  1. 文档级聚类优先:首先对文档进行语义聚类,确保同一文档内容不被分散
    2.智能分块聚类:基于Token预估,动态决定是否需要对chunk进行二级聚类
  2. 层级化总结:知识卡片 → 文档簇总结 → 全局总结,层层提炼
  3. 异步并发处理:利用asyncio并发调用LLM,显著提升处理速度
  4. 结构化输出:自动生成分点式总结,每点80-120字,清晰简洁

主要函数:
│ └── async_knowledge_summary_utils.py # 核心工具类 (1000+ 行)
│ ├── AsyncLLMClient # 异步LLM客户端
│ ├── DocumentClusterer # 文档聚类器
│ ├── ChunkClusterer # Chunk聚类器
│ ├── KnowledgeIntegrator # 知识整合器
│ └── async_vectorize_batch() # 异步向量化

├── services/
│ └── elasticsearch_service.py # RAG流程编排
│ ├── summary_index_name() # 主流程入口
│ ├── _reconstruct_documents() # 文档重建
│ └── _organize_chunks_by_clusters() # Chunk重组

Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 13.78238% with 832 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/utils/async_knowledge_summary_utils.py 11.75% 428 Missing ⚠️
sdk/nexent/vector_database/elasticsearch_core.py 20.41% 269 Missing ⚠️
backend/services/elasticsearch_service.py 4.28% 134 Missing ⚠️
backend/utils/prompt_template_utils.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@@ -0,0 +1,116 @@
# Async Knowledge Summary Prompt Templates (Chinese)
# 异步知识库总结提示词模板(中文版)
Copy link
Collaborator

@SimengBian SimengBian Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议注释统一用英文就可以,无需提供中文注释

@@ -0,0 +1,116 @@
# Async Knowledge Summary Prompt Templates (Chinese)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提示词这里是否可以不体现async异步

@@ -0,0 +1,115 @@
# Async Knowledge Summary Prompt Templates (English)
# 异步知识库总结提示词模板(英文版)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,只需要英文注释就可以,另外,有些注释没有必要(例如下面这行注释),可以删掉


# Summary Generation Prompt
SUMMARY_GENERATION_PROMPT: |-
### You are a【Knowledge Summary Expert】responsible for generating concise and accurate knowledge summaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议英文提示词中不要用【这样的中文符号,不太友好,可以替换为[

raise Exception("Failed to get embedding model")

# Async summary generation stream
async def generate_summary_stream():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数过于复杂,建议尝试拆分下

documents = {}

for chunk in chunks:
doc_id = chunk.get('filename', chunk.get('source_doc', 'unknown'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议使用or写法,更清晰:doc_id = chunk.get('filename') or chunk.get('source_doc') or 'unknown'


# Group chunks by document cluster
for chunk in chunks:
doc_id = chunk.get('filename', chunk.get('source_doc', 'unknown'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,建议采用or操作符提升可读性

doc_vectors = await async_vectorize_batch(doc_texts, embedding_model, batch_size=20)

# Document-level clustering
from utils.async_knowledge_summary_utils import DocumentClusterer
Copy link
Collaborator

@SimengBian SimengBian Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不建议在代码中间位置进行import,建议整理,包括后面的numpy

chunk_clusterer = ChunkClusterer(similarity_threshold=0.70, min_cluster_size=1)
chunk_cluster_result = chunk_clusterer.cluster_chunks_with_document_clusters(chunk_vectors, chunks, chunks_by_doc_cluster)

n_clusters = chunk_cluster_result['n_clusters']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否要考虑chunk_cluster_result为None的情况?n_clusters = chunk_cluster_result.get('n_clusters', 0)

# Escape quotes
yield f"data: {{\"status\": \"success\", \"message\": \"\\\"\"}}\n\n"
else:
yield f"data: {{\"status\": \"success\", \"message\": \"{char}\"}}\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种处理方式太粗暴了,使用json.dumps应该能很好的处理这种json格式问题

except Exception as e:
logger.error(f"Error in async summary generation: {e}", exc_info=True)
error_msg = str(e).replace('"', '\\"').replace('\n', '\\n')
yield f"data: {{\"status\": \"error\", \"message\": \"{error_msg}\"}}\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种流式的message建议统一整改使用json.dumps


response = await self.chat_async(
messages=messages,
max_tokens=max_length, # 减少token数量,强制简洁
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释英文化

"""Calculate confidence score"""
size_score = min(chunk_cluster['size'] / 10.0, 1.0)
similarity_score = chunk_cluster.get('avg_similarity', 0.0)
confidence = 0.4 * size_score + 0.6 * similarity_score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种直接使用数值计算的函数建议增加注释,说明使用这种计算方式的原因,以提升代码可读性


def _fallback_keyword_extraction(self, text: str) -> List[str]:
"""Fallback keyword extraction"""
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import 语句注意放在文件顶部

import re

words = re.findall(r'[\u4e00-\u9fa5]+', text)
stop_words = {'的', '了', '和', '是', '在', '有', '个', '等', '与', '及'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么要单独给中文处理stop_words呢?

logger.warning("LLM global integration failed, using fallback strategy")
return "\n\n".join(cluster_summaries)

def _clean_markdown_symbols(self, text: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前已经看到过有一个这个工具函数了,建议抽取出来复用

import re

words = re.findall(r'[\u4e00-\u9fa5]+', text)
stop_words = {'的', '了', '和', '是', '在', '有', '个', '等', '与', '及'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议提取为公共常量

formatted_text = '\n\n'.join(formatted_lines)

# Ensure each point has proper spacing and clear structure
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

较多import位置不当,建议排查下修正

import re
# Clean up any extra spaces and ensure consistent formatting
formatted_text = re.sub(r'\n\s*\n\s*\n', '\n\n', formatted_text) # Remove excessive line breaks
formatted_text = re.sub(r'([一二三四五六七八九十]+、[^一-十\n]+)(?=\n|$)', r'\1', formatted_text) # Ensure proper ending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些建议提取为常量,方便后续统一修改

logger.error(f"Error getting stats for index {index_name}: {str(e)}")
all_stats[index_name] = {"error": str(e)}

return all_stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件看上去没有改动,是不是修改了制表符导致git识别到修改,建议恢复

self.assertIsInstance(result, StreamingResponse)
mock_get_docs.assert_called_once()
mock_calculate_weights.assert_called_once()
mock_get_model_by_model_id.assert_called_once_with(1, "test_tenant")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增代码必定存在ut新增的,需要补上

@Phinease
Copy link
Member

image 单测覆盖率不足,请补足

@WMC001 WMC001 changed the title ✨ Knowledge Base Summary Development [Next week]✨ Knowledge Base Summary Development Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants