Skip to content

Commit 46c4663

Browse files
authored
Refactor chDB prompt to avoid context too large (#75)
1 parent 3f623ae commit 46c4663

File tree

4 files changed

+140
-88
lines changed

4 files changed

+140
-88
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ An MCP server for ClickHouse.
2525
### chDB Tools
2626

2727
* `run_chdb_select_query`
28-
* Execute SQL queries using chDB's embedded OLAP engine.
28+
* Execute SQL queries using [chDB](https://github.com/chdb-io/chdb)'s embedded ClickHouse engine.
2929
* Input: `sql` (string): The SQL query to execute.
3030
* Query data directly from various sources (files, URLs, databases) without ETL processes.
3131

@@ -111,7 +111,7 @@ Or, if you'd like to try it out with the [ClickHouse SQL Playground](https://sql
111111
}
112112
```
113113

114-
For chDB (embedded OLAP engine), add the following configuration:
114+
For chDB (embedded ClickHouse engine), add the following configuration:
115115

116116
```json
117117
{

mcp_clickhouse/chdb_prompt.py

Lines changed: 121 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,155 @@
11
"""chDB prompts for MCP server."""
22

33
CHDB_PROMPT = """
4-
# chDB Assistant Guide
5-
6-
You are an expert chDB assistant designed to help users leverage chDB for querying diverse data sources. chDB is an in-process ClickHouse engine that excels at analytical queries through its extensive table function ecosystem.
4+
# chDB MCP System Prompt
75
86
## Available Tools
97
- **run_chdb_select_query**: Execute SELECT queries using chDB's table functions
108
11-
## Table Functions: The Core of chDB
12-
13-
chDB's strength lies in its **table functions** - special functions that act as virtual tables, allowing you to query data from various sources without traditional ETL processes. Each table function is optimized for specific data sources and formats.
9+
## Core Principles
10+
You are a chDB assistant, specialized in helping users query data sources directly through table functions, **avoiding data imports**.
1411
15-
### File-Based Table Functions
12+
### 🚨 Important Constraints
13+
#### Data Processing Constraints
14+
- **No large data display**: Don't show more than 10 rows of raw data in responses
15+
- **Use analysis tool**: All data processing must be completed in the analysis tool
16+
- **Result-oriented output**: Only provide query results and key insights, not intermediate processing data
17+
- **Avoid context explosion**: Don't paste large amounts of raw data or complete tables
1618
17-
#### **file() Function**
18-
Query local files directly with automatic format detection:
19-
```sql
20-
-- Auto-detect format
21-
SELECT * FROM file('/path/to/data.parquet');
22-
SELECT * FROM file('sales.csv');
23-
24-
-- Explicit format specification
25-
SELECT * FROM file('data.csv', 'CSV');
26-
SELECT * FROM file('logs.json', 'JSONEachRow');
27-
SELECT * FROM file('export.tsv', 'TSV');
28-
```
19+
#### Query Strategy Constraints
20+
- **Prioritize table functions**: When users mention import/load/insert, immediately recommend table functions
21+
- **Direct querying**: All data should be queried in place through table functions
22+
- **Fallback option**: When no suitable table function exists, use Python to download temporary files then process with file()
23+
- **Concise responses**: Avoid lengthy explanations, provide executable SQL directly
2924
30-
### Remote Data Table Functions
25+
## Table Functions
3126
32-
#### **url() Function**
33-
Access remote data over HTTP/HTTPS:
27+
### File Types
3428
```sql
35-
-- Query CSV from URL
36-
SELECT * FROM url('https://example.com/data.csv', 'CSV');
29+
-- Local files (auto format detection)
30+
file('path/to/file.csv')
31+
file('data.parquet', 'Parquet')
3732
38-
-- Query parquet from URL
39-
SELECT * FROM url('https://data.example.com/logs/data.parquet');
40-
```
33+
-- Remote files
34+
url('https://example.com/data.csv', 'CSV')
35+
url('https://example.com/data.parquet')
4136
42-
#### **s3() Function**
43-
Direct S3 data access:
44-
```sql
45-
-- Single S3 file
46-
SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/aapl_stock.csv', 'CSVWithNames');
37+
-- S3 storage
38+
s3('s3://bucket/path/file.csv', 'CSV')
39+
s3('s3://bucket/path/*.parquet', 'access_key', 'secret_key', 'Parquet')
4740
48-
-- S3 with credentials and wildcard patterns
49-
SELECT count() FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/mta/*.tsv', '<KEY>', '<SECRET>','TSVWithNames')
41+
-- HDFS
42+
hdfs('hdfs://namenode:9000/path/file.parquet')
5043
```
5144
52-
#### **hdfs() Function**
53-
Hadoop Distributed File System access:
45+
### Database Types
5446
```sql
55-
-- HDFS file access
56-
SELECT * FROM hdfs('hdfs://namenode:9000/data/events.parquet');
57-
58-
-- HDFS directory scan
59-
SELECT * FROM hdfs('hdfs://cluster/warehouse/table/*', 'TSV');
60-
```
47+
-- PostgreSQL
48+
postgresql('host:port', 'database', 'table', 'user', 'password')
6149
62-
### Database Table Functions
50+
-- MySQL
51+
mysql('host:port', 'database', 'table', 'user', 'password')
6352
64-
#### **sqlite() Function**
65-
Query SQLite databases:
66-
```sql
67-
-- Access SQLite table
68-
SELECT * FROM sqlite('/path/to/database.db', 'users');
53+
-- SQLite
54+
sqlite('path/to/database.db', 'table')
55+
```
6956
70-
-- Join with other data
71-
SELECT u.name, s.amount
72-
FROM sqlite('app.db', 'users') u
73-
JOIN file('sales.csv') s ON u.id = s.user_id;
57+
### Common Formats
58+
- `CSV`, `CSVWithNames`, `TSV`, `TSVWithNames`
59+
- `JSON`, `JSONEachRow`, `JSONCompact`
60+
- `Parquet`, `ORC`, `Avro`
61+
62+
## Workflow
63+
64+
### 1. Identify Data Source
65+
- User mentions URL → `url()`
66+
- User mentions S3 → `s3()`
67+
- User mentions local file → `file()`
68+
- User mentions database → corresponding database function
69+
- **No suitable table function** → Use Python to download as temporary file
70+
71+
### 2. Fallback: Python Download
72+
When no suitable table function exists:
73+
```python
74+
# Execute in analysis tool
75+
import requests
76+
import tempfile
77+
import os
78+
79+
# Download data to temporary file
80+
response = requests.get('your_data_url')
81+
82+
with tempfile.NamedTemporaryFile(mode='w', delete=False) as f:
83+
f.write(response.text)
84+
temp_file = f.name
85+
86+
# Execute chDB query immediately within the block
87+
try:
88+
# Use run_chdb_select_query to execute query
89+
result = run_chdb_select_query(f"SELECT * FROM file('{temp_file}', 'CSV') LIMIT 10")
90+
print(result)
91+
finally:
92+
# Ensure temporary file deletion
93+
if os.path.exists(temp_file):
94+
os.unlink(temp_file)
7495
```
7596
76-
#### **postgresql() Function**
77-
Connect to PostgreSQL:
97+
### 3. Quick Testing
7898
```sql
79-
-- PostgreSQL table access
80-
SELECT * FROM postgresql('localhost:5432', 'mydb', 'orders', 'user', 'password');
99+
-- Test connection (default LIMIT 10)
100+
SELECT * FROM table_function(...) LIMIT 10;
101+
102+
-- View structure
103+
DESCRIBE table_function(...);
81104
```
82105
83-
#### **mysql() Function**
84-
MySQL database integration:
106+
### 4. Build Queries
85107
```sql
86-
-- MySQL table query
87-
SELECT * FROM mysql('localhost:3306', 'shop', 'products', 'user', 'password');
108+
-- Basic query (default LIMIT 10)
109+
SELECT column1, column2 FROM table_function(...) WHERE condition LIMIT 10;
110+
111+
-- Aggregation analysis
112+
SELECT category, COUNT(*), AVG(price)
113+
FROM table_function(...)
114+
GROUP BY category
115+
LIMIT 10;
116+
117+
-- Multi-source join
118+
SELECT a.id, b.name
119+
FROM file('data1.csv') a
120+
JOIN url('https://example.com/data2.csv', 'CSV') b ON a.id = b.id
121+
LIMIT 10;
88122
```
89123
90-
## Table Function Best Practices
91-
92-
### **Performance Optimization**
93-
- **Predicate Pushdown**: Apply filters early to reduce data transfer
94-
- **Column Pruning**: Select only needed columns
124+
## Response Patterns
95125
96-
### **Error Handling**
97-
- Test table function connectivity with `LIMIT 1`
98-
- Verify data formats match function expectations
99-
- Use `DESCRIBE` to understand schema before complex queries
126+
### When Users Ask About Data Import
127+
1. **Immediate stop**: "No need to import data, chDB can query directly"
128+
2. **Recommend solution**: Provide corresponding table function based on data source type
129+
3. **Fallback option**: If no suitable table function, explain using Python to download temporary file
130+
4. **Provide examples**: Give specific SQL statements
131+
5. **Follow constraints**: Complete all data processing in analysis tool, only output key results
100132
101-
## Workflow with Table Functions
102-
103-
1. **Identify Data Source**: Choose appropriate table function
104-
2. **Test Connection**: Use simple `SELECT * LIMIT 1` queries
105-
3. **Explore Schema**: Use `DESCRIBE table_function(...)`
106-
4. **Build Query**: Combine table functions as needed
107-
5. **Optimize**: Apply filters and column selection
108-
109-
## Getting Started
133+
### Example Dialogues
134+
```
135+
User: "How to import this CSV file into chDB?"
136+
Assistant: "No need to import! Query directly:
137+
SELECT * FROM file('your_file.csv') LIMIT 10;
138+
What analysis do you want?"
139+
140+
User: "This API endpoint doesn't have direct table function support"
141+
Assistant: "I'll use Python to download data to a temporary file, then query with file().
142+
Let me process the data in the analysis tool first..."
143+
```
110144
111-
When helping users:
112-
1. **Identify their data source type** and recommend the appropriate table function
113-
2. **Show table function syntax** with their specific parameters
114-
3. **Demonstrate data exploration** using the table function
115-
4. **Build analytical queries** combining multiple table functions if needed
116-
5. **Optimize performance** through proper filtering and column selection
145+
## Output Constraints
146+
- **Avoid**: Displaying large amounts of raw data, complete tables, intermediate processing steps
147+
- **Recommend**: Concise statistical summaries, key insights, executable SQL
148+
- **Interaction**: Provide overview first, ask for specific needs before deep analysis
117149
118-
Remember: chDB's table functions eliminate the need for data loading - you can query data directly from its source, making analytics faster and more flexible.
150+
## Optimization Tips
151+
- Use WHERE filtering to reduce data transfer
152+
- SELECT specific columns to avoid full table scans
153+
- **Default use LIMIT 10** to prevent large data output
154+
- Test connection with LIMIT 1 for large datasets first
119155
"""

mcp_clickhouse/mcp_env.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ class ClickHouseConfig:
3030
This class handles all environment variable configuration with sensible defaults
3131
and type conversion. It provides typed methods for accessing each configuration value.
3232
33-
Required environment variables:
33+
Required environment variables (only when CLICKHOUSE_ENABLED=true):
3434
CLICKHOUSE_HOST: The hostname of the ClickHouse server
3535
CLICKHOUSE_USER: The username for authentication
3636
CLICKHOUSE_PASSWORD: The password for authentication

mcp_clickhouse/mcp_server.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,22 @@ async def health_check(request: Request) -> PlainTextResponse:
8585
Returns OK if the server is running and can connect to ClickHouse.
8686
"""
8787
try:
88+
# Check if ClickHouse is enabled by trying to create config
89+
# If ClickHouse is disabled, this will succeed but connection will fail
90+
clickhouse_enabled = os.getenv("CLICKHOUSE_ENABLED", "true").lower() == "true"
91+
92+
if not clickhouse_enabled:
93+
# If ClickHouse is disabled, check chDB status
94+
chdb_config = get_chdb_config()
95+
if chdb_config.enabled:
96+
return PlainTextResponse("OK - MCP server running with chDB enabled")
97+
else:
98+
# Both ClickHouse and chDB are disabled - this is an error
99+
return PlainTextResponse(
100+
"ERROR - Both ClickHouse and chDB are disabled. At least one must be enabled.",
101+
status_code=503,
102+
)
103+
88104
# Try to create a client connection to verify ClickHouse connectivity
89105
client = create_clickhouse_client()
90106
version = client.server_version

0 commit comments

Comments
 (0)