-
Notifications
You must be signed in to change notification settings - Fork 1
DEVAI-346: Add hive query tool #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Hive query execution support to the MCP server, enabling LLMs to rewrite and validate Trino queries against Hive. The implementation provides four new Hive-specific tools that complement the existing Trino functionality.
- Implements a complete Hive client using Treasure Data's v3 REST API
- Adds four Hive tools: query execution, job management, and result retrieval
- Integrates existing security validation and audit logging systems
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/client/hive.ts |
New TDHiveClient implementing REST API communication with job management and query execution |
src/client/tdapi/endpoints.ts |
Centralized TD API endpoint configuration for different sites |
src/tools/hive/*.ts |
Four new Hive tools providing query, execute, job-status, and job-result functionality |
src/server.ts |
Integration of Hive tools into the MCP server's tool registry and handlers |
src/security/audit-logger.ts |
Fixed console logging to use stderr instead of stdout for MCP compatibility |
tests/**/*.test.ts |
Comprehensive test coverage for all new Hive functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Review Summary
Thank you for adding Hive query functionality! The implementation is comprehensive with good security practices and test coverage. Below are my findings:
✅ Strengths:
- Comprehensive implementation with query, execute, job status, and result operations
- Good test coverage with proper mocking
- Secure by design with SQL validation and audit logging
- Proper API key masking in errors
- Well-structured code organization under
src/tools/hive/
🔴 Critical Issues to Fix:
- Unrelated change in audit-logger.ts - Revert the console.log → console.error change
- Missing documentation - Add Hive integration details to CLAUDE.md
🟡 Recommendations:
- Consider refactoring to class-based pattern for consistency with existing Presto tools
- Add error logging in
getJobErrorDetails
for debugging - Consider adding integration tests (can be follow-up PR)
I'll add specific inline comments for the issues found.
const rowCount = entry.rowCount !== undefined ? ` -> ${entry.rowCount} rows` : ''; | ||
|
||
console.log( | ||
console.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Critical Issue: This change from console.log
to console.error
is unrelated to the Hive feature and breaks consistency with the existing audit logging pattern. Please revert this change.
properties: {}, | ||
}, | ||
}, | ||
// Hive Tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 Documentation Needed: This PR should also update README/CLAUDE.md to document the new Hive integration, similar to the existing CDP and Workflow sections. The documentation should include:
- Overview of Hive integration and when to use it vs Presto
- Available tools and their usage
- Authentication details
- Example usage patterns
private async getJobErrorDetails(jobId: string): Promise<string | undefined> { | ||
try { | ||
const show = await this.request<JobShowResponse>( | ||
'GET', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Suggestion: Consider adding debug logging here when errors occur while fetching job details. This would help with troubleshooting issues in production:
} catch (e) {
if (process.env.TD_MCP_LOG_TO_CONSOLE === 'true') {
console.error('[Hive] Failed to get job error details:', e);
}
return undefined;
}
inputSchema: { | ||
type: 'object', | ||
properties: { | ||
sql: { type: 'string', description: 'Hive SQL query (SELECT, SHOW, DESCRIBE only).' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's note the use of Hive SQL dialect and Treasure Data UDFs, so that LLM can generate TD-Hive-specific queries.
limit: z.number().int().min(1).max(10000).optional().describe('Max rows to return (default 40)') | ||
}); | ||
|
||
export const hiveQuery = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 Architecture Consideration: The Hive tools export simple objects with handler
functions, while the existing Presto QueryTool
uses a class-based approach. Consider refactoring to match the class pattern for better consistency with the existing codebase architecture.
However, I notice the CDP and Workflow tools also use the simple object pattern, so this may be acceptable. Just worth noting the inconsistency between different tool types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Is a review from Claude Code.
I think using top-level function is better for TypeScript, so no need to use class here. We need to revise existing code to use top-level functions
|
||
const config = loadConfig(); | ||
const client = new TDHiveClient({ ...config, database: database || config.database }); | ||
const validator = new QueryValidator(false); // disallow writes regardless of env flag for this tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Good Security Practice: Excellent implementation here - hive_query
hardcodes read-only validation regardless of the config flag, ensuring this tool can never perform write operations. This is a solid security pattern.
|
||
expect(consoleLogSpy).toHaveBeenCalled(); | ||
const logMessage = consoleLogSpy.mock.calls[0][0] as string; | ||
expect(consoleErrorSpy).toHaveBeenCalled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consoleLogSpy
to consoleErrorSpy
should be reverted along with the audit-logger.ts change to maintain consistency.
handler: async (args: unknown) => { | ||
const { sql, database, limit } = inputSchema.parse(args); | ||
|
||
const config = loadConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Database Context Pattern Issue: The database parameter handling here doesn't follow the established pattern used by Trino tools.
Current implementation creates a new client with the database baked into the config:
const client = new TDHiveClient({ ...config, database: database || config.database });
This should follow the Trino pattern of using a useDatabase
context method instead:
const client = new TDHiveClient(config);
if (database) {
client.useDatabase(database);
}
This inconsistency will confuse LLMs about how to properly set database context across different tools.
}, | ||
handler: async (args: unknown): Promise<HiveExecuteWriteResult | HiveQueryResult> => { | ||
const { sql, database, priority, retry_limit, pool_name, limit } = inputSchema.parse(args); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Same Database Pattern Issue: This also creates a new client with database in config rather than using a useDatabase
pattern. Should be consistent with Trino tools.
|
||
/** | ||
* Minimal REST client for Treasure Data Hive jobs (v3 API) | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Missing useDatabase Method: The TDHiveClient
should implement a useDatabase
method to match the pattern used by TDTrinoClient
. This would ensure consistent database context handling across all query tools.
Suggested implementation:
export class TDHiveClient {
private readonly apiKey: string;
private readonly baseUrl: string;
private currentDatabase: string; // Add this
readonly database: string;
constructor(config: Config) {
this.apiKey = config.td_api_key;
this.baseUrl = getTdApiEndpointForSite(config.site);
this.database = config.database || 'sample_datasets';
this.currentDatabase = this.database; // Initialize
}
// Add this method
useDatabase(database: string): this {
this.currentDatabase = database;
return this;
}
async issueHive(query: string, options?: HiveIssueJobOptions): Promise<{ job_id: string }> {
// Use this.currentDatabase instead of database parameter
const db = this.currentDatabase;
// ... rest of implementation
}
}
This would allow tools to use: client.useDatabase(database)
instead of passing database through config, matching the Trino pattern.
@@ -0,0 +1,5 @@ | |||
export { hiveQuery } from './query.js'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 LLM Usage Complexity: The async job model of Hive (submit → poll → fetch results) is significantly more complex than Trino's synchronous model. This PR should add clear documentation in CLAUDE.md/README.md explaining:
- When to use each tool:
hive_query
: For simple SELECT queries (auto-manages job lifecycle)hive_execute
: For write operations or when you need the job_idhive_job_status
+hive_job_result
: For long-running jobs or recovering results
IMPORTANT These differences should be noted in the function description so that LLM can understand the method differences.
-
Key differences from Trino:
- Trino: Synchronous, immediate results
- Hive: Asynchronous jobs, may take minutes/hours
-
Usage patterns:
// Simple query hive_query(sql, database) // Long-running job job_id = hive_execute(sql, database) status = hive_job_status(job_id) // Poll until complete results = hive_job_result(job_id) // Fetch when ready
Without proper documentation, LLMs will struggle to choose the right tool and handle the async nature correctly.
currently only Trino is supported.
I'd like to have LLM use hive for a case like rewrite Trino query to Hive query.
With this feature, we can rely on LLM to rewriting as it can execute the query and check if it will work or not.