-
Notifications
You must be signed in to change notification settings - Fork 300
[Next week]✨ Knowledge Base Summary Development #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
# Conflicts: # backend/pyproject.toml
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
@@ -0,0 +1,116 @@ | |||
# Async Knowledge Summary Prompt Templates (Chinese) | |||
# 异步知识库总结提示词模板(中文版) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议注释统一用英文就可以,无需提供中文注释
@@ -0,0 +1,116 @@ | |||
# Async Knowledge Summary Prompt Templates (Chinese) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
提示词这里是否可以不体现async异步
@@ -0,0 +1,115 @@ | |||
# Async Knowledge Summary Prompt Templates (English) | |||
# 异步知识库总结提示词模板(英文版) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,只需要英文注释就可以,另外,有些注释没有必要(例如下面这行注释),可以删掉
|
||
# Summary Generation Prompt | ||
SUMMARY_GENERATION_PROMPT: |- | ||
### You are a【Knowledge Summary Expert】responsible for generating concise and accurate knowledge summaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议英文提示词中不要用【这样的中文符号,不太友好,可以替换为[
raise Exception("Failed to get embedding model") | ||
|
||
# Async summary generation stream | ||
async def generate_summary_stream(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数过于复杂,建议尝试拆分下
documents = {} | ||
|
||
for chunk in chunks: | ||
doc_id = chunk.get('filename', chunk.get('source_doc', 'unknown')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议使用or写法,更清晰:doc_id = chunk.get('filename') or chunk.get('source_doc') or 'unknown'
|
||
# Group chunks by document cluster | ||
for chunk in chunks: | ||
doc_id = chunk.get('filename', chunk.get('source_doc', 'unknown')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,建议采用or操作符提升可读性
doc_vectors = await async_vectorize_batch(doc_texts, embedding_model, batch_size=20) | ||
|
||
# Document-level clustering | ||
from utils.async_knowledge_summary_utils import DocumentClusterer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不建议在代码中间位置进行import,建议整理,包括后面的numpy
chunk_clusterer = ChunkClusterer(similarity_threshold=0.70, min_cluster_size=1) | ||
chunk_cluster_result = chunk_clusterer.cluster_chunks_with_document_clusters(chunk_vectors, chunks, chunks_by_doc_cluster) | ||
|
||
n_clusters = chunk_cluster_result['n_clusters'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否要考虑chunk_cluster_result为None的情况?n_clusters = chunk_cluster_result.get('n_clusters', 0)
# Escape quotes | ||
yield f"data: {{\"status\": \"success\", \"message\": \"\\\"\"}}\n\n" | ||
else: | ||
yield f"data: {{\"status\": \"success\", \"message\": \"{char}\"}}\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种处理方式太粗暴了,使用json.dumps应该能很好的处理这种json格式问题
except Exception as e: | ||
logger.error(f"Error in async summary generation: {e}", exc_info=True) | ||
error_msg = str(e).replace('"', '\\"').replace('\n', '\\n') | ||
yield f"data: {{\"status\": \"error\", \"message\": \"{error_msg}\"}}\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种流式的message建议统一整改使用json.dumps
|
||
response = await self.chat_async( | ||
messages=messages, | ||
max_tokens=max_length, # 减少token数量,强制简洁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释英文化
"""Calculate confidence score""" | ||
size_score = min(chunk_cluster['size'] / 10.0, 1.0) | ||
similarity_score = chunk_cluster.get('avg_similarity', 0.0) | ||
confidence = 0.4 * size_score + 0.6 * similarity_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种直接使用数值计算的函数建议增加注释,说明使用这种计算方式的原因,以提升代码可读性
|
||
def _fallback_keyword_extraction(self, text: str) -> List[str]: | ||
"""Fallback keyword extraction""" | ||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import 语句注意放在文件顶部
import re | ||
|
||
words = re.findall(r'[\u4e00-\u9fa5]+', text) | ||
stop_words = {'的', '了', '和', '是', '在', '有', '个', '等', '与', '及'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么要单独给中文处理stop_words呢?
logger.warning("LLM global integration failed, using fallback strategy") | ||
return "\n\n".join(cluster_summaries) | ||
|
||
def _clean_markdown_symbols(self, text: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前已经看到过有一个这个工具函数了,建议抽取出来复用
import re | ||
|
||
words = re.findall(r'[\u4e00-\u9fa5]+', text) | ||
stop_words = {'的', '了', '和', '是', '在', '有', '个', '等', '与', '及'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议提取为公共常量
formatted_text = '\n\n'.join(formatted_lines) | ||
|
||
# Ensure each point has proper spacing and clear structure | ||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
较多import位置不当,建议排查下修正
import re | ||
# Clean up any extra spaces and ensure consistent formatting | ||
formatted_text = re.sub(r'\n\s*\n\s*\n', '\n\n', formatted_text) # Remove excessive line breaks | ||
formatted_text = re.sub(r'([一二三四五六七八九十]+、[^一-十\n]+)(?=\n|$)', r'\1', formatted_text) # Ensure proper ending |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些建议提取为常量,方便后续统一修改
logger.error(f"Error getting stats for index {index_name}: {str(e)}") | ||
all_stats[index_name] = {"error": str(e)} | ||
|
||
return all_stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件看上去没有改动,是不是修改了制表符导致git识别到修改,建议恢复
self.assertIsInstance(result, StreamingResponse) | ||
mock_get_docs.assert_called_once() | ||
mock_calculate_weights.assert_called_once() | ||
mock_get_model_by_model_id.assert_called_once_with(1, "test_tenant") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增代码必定存在ut新增的,需要补上
✨ Knowledge Base Summary Development

一个基于文档级聚类和层级总结的知识库智能总结方法
可以正确分点,完成内容正确划分,目前是内容的总结,也可以加上一些分析。
2.智能分块聚类:基于Token预估,动态决定是否需要对chunk进行二级聚类
主要函数:
│ └── async_knowledge_summary_utils.py # 核心工具类 (1000+ 行)
│ ├── AsyncLLMClient # 异步LLM客户端
│ ├── DocumentClusterer # 文档聚类器
│ ├── ChunkClusterer # Chunk聚类器
│ ├── KnowledgeIntegrator # 知识整合器
│ └── async_vectorize_batch() # 异步向量化
│
├── services/
│ └── elasticsearch_service.py # RAG流程编排
│ ├── summary_index_name() # 主流程入口
│ ├── _reconstruct_documents() # 文档重建
│ └── _organize_chunks_by_clusters() # Chunk重组