Skip to content

Conversation

mickeey2525
Copy link

@mickeey2525 mickeey2525 commented Sep 7, 2025

currently only Trino is supported.

I'd like to have LLM use hive for a case like rewrite Trino query to Hive query.
With this feature, we can rely on LLM to rewriting as it can execute the query and check if it will work or not.

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Hive query execution support to the MCP server, enabling LLMs to rewrite and validate Trino queries against Hive. The implementation provides four new Hive-specific tools that complement the existing Trino functionality.

  • Implements a complete Hive client using Treasure Data's v3 REST API
  • Adds four Hive tools: query execution, job management, and result retrieval
  • Integrates existing security validation and audit logging systems

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/client/hive.ts New TDHiveClient implementing REST API communication with job management and query execution
src/client/tdapi/endpoints.ts Centralized TD API endpoint configuration for different sites
src/tools/hive/*.ts Four new Hive tools providing query, execute, job-status, and job-result functionality
src/server.ts Integration of Hive tools into the MCP server's tool registry and handlers
src/security/audit-logger.ts Fixed console logging to use stderr instead of stdout for MCP compatibility
tests/**/*.test.ts Comprehensive test coverage for all new Hive functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mickeey2525 mickeey2525 requested a review from Copilot September 18, 2025 09:05
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@yuokada yuokada requested a review from xerial September 25, 2025 09:00
@xerial xerial changed the title add hive query tool DEVAI-346: Add hive query tool Sep 26, 2025
Copy link
Member

@xerial xerial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Thank you for adding Hive query functionality! The implementation is comprehensive with good security practices and test coverage. Below are my findings:

✅ Strengths:

  • Comprehensive implementation with query, execute, job status, and result operations
  • Good test coverage with proper mocking
  • Secure by design with SQL validation and audit logging
  • Proper API key masking in errors
  • Well-structured code organization under src/tools/hive/

🔴 Critical Issues to Fix:

  1. Unrelated change in audit-logger.ts - Revert the console.log → console.error change
  2. Missing documentation - Add Hive integration details to CLAUDE.md

🟡 Recommendations:

  1. Consider refactoring to class-based pattern for consistency with existing Presto tools
  2. Add error logging in getJobErrorDetails for debugging
  3. Consider adding integration tests (can be follow-up PR)

I'll add specific inline comments for the issues found.

const rowCount = entry.rowCount !== undefined ? ` -> ${entry.rowCount} rows` : '';

console.log(
console.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical Issue: This change from console.log to console.error is unrelated to the Hive feature and breaks consistency with the existing audit logging pattern. Please revert this change.

properties: {},
},
},
// Hive Tools
Copy link
Member

@xerial xerial Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Documentation Needed: This PR should also update README/CLAUDE.md to document the new Hive integration, similar to the existing CDP and Workflow sections. The documentation should include:

  • Overview of Hive integration and when to use it vs Presto
  • Available tools and their usage
  • Authentication details
  • Example usage patterns

private async getJobErrorDetails(jobId: string): Promise<string | undefined> {
try {
const show = await this.request<JobShowResponse>(
'GET',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Consider adding debug logging here when errors occur while fetching job details. This would help with troubleshooting issues in production:

} catch (e) {
  if (process.env.TD_MCP_LOG_TO_CONSOLE === 'true') {
    console.error('[Hive] Failed to get job error details:', e);
  }
  return undefined;
}

inputSchema: {
type: 'object',
properties: {
sql: { type: 'string', description: 'Hive SQL query (SELECT, SHOW, DESCRIBE only).' },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's note the use of Hive SQL dialect and Treasure Data UDFs, so that LLM can generate TD-Hive-specific queries.

limit: z.number().int().min(1).max(10000).optional().describe('Max rows to return (default 40)')
});

export const hiveQuery = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Architecture Consideration: The Hive tools export simple objects with handler functions, while the existing Presto QueryTool uses a class-based approach. Consider refactoring to match the class pattern for better consistency with the existing codebase architecture.

However, I notice the CDP and Workflow tools also use the simple object pattern, so this may be acceptable. Just worth noting the inconsistency between different tool types.

Copy link
Member

@xerial xerial Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Is a review from Claude Code.

I think using top-level function is better for TypeScript, so no need to use class here. We need to revise existing code to use top-level functions


const config = loadConfig();
const client = new TDHiveClient({ ...config, database: database || config.database });
const validator = new QueryValidator(false); // disallow writes regardless of env flag for this tool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Security Practice: Excellent implementation here - hive_query hardcodes read-only validation regardless of the config flag, ensuring this tool can never perform write operations. This is a solid security pattern.


expect(consoleLogSpy).toHaveBeenCalled();
const logMessage = consoleLogSpy.mock.calls[0][0] as string;
expect(consoleErrorSpy).toHaveBeenCalled();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Test Change: These test modifications appear to be unrelated to the Hive feature. The changes from consoleLogSpy to consoleErrorSpy should be reverted along with the audit-logger.ts change to maintain consistency.

handler: async (args: unknown) => {
const { sql, database, limit } = inputSchema.parse(args);

const config = loadConfig();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Database Context Pattern Issue: The database parameter handling here doesn't follow the established pattern used by Trino tools.

Current implementation creates a new client with the database baked into the config:

const client = new TDHiveClient({ ...config, database: database || config.database });

This should follow the Trino pattern of using a useDatabase context method instead:

const client = new TDHiveClient(config);
if (database) {
  client.useDatabase(database);
}

This inconsistency will confuse LLMs about how to properly set database context across different tools.

},
handler: async (args: unknown): Promise<HiveExecuteWriteResult | HiveQueryResult> => {
const { sql, database, priority, retry_limit, pool_name, limit } = inputSchema.parse(args);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Same Database Pattern Issue: This also creates a new client with database in config rather than using a useDatabase pattern. Should be consistent with Trino tools.


/**
* Minimal REST client for Treasure Data Hive jobs (v3 API)
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing useDatabase Method: The TDHiveClient should implement a useDatabase method to match the pattern used by TDTrinoClient. This would ensure consistent database context handling across all query tools.

Suggested implementation:

export class TDHiveClient {
  private readonly apiKey: string;
  private readonly baseUrl: string;
  private currentDatabase: string;  // Add this
  readonly database: string;

  constructor(config: Config) {
    this.apiKey = config.td_api_key;
    this.baseUrl = getTdApiEndpointForSite(config.site);
    this.database = config.database || 'sample_datasets';
    this.currentDatabase = this.database;  // Initialize
  }

  // Add this method
  useDatabase(database: string): this {
    this.currentDatabase = database;
    return this;
  }

  async issueHive(query: string, options?: HiveIssueJobOptions): Promise<{ job_id: string }> {
    // Use this.currentDatabase instead of database parameter
    const db = this.currentDatabase;
    // ... rest of implementation
  }
}

This would allow tools to use: client.useDatabase(database) instead of passing database through config, matching the Trino pattern.

@@ -0,0 +1,5 @@
export { hiveQuery } from './query.js';
Copy link
Member

@xerial xerial Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 LLM Usage Complexity: The async job model of Hive (submit → poll → fetch results) is significantly more complex than Trino's synchronous model. This PR should add clear documentation in CLAUDE.md/README.md explaining:

  1. When to use each tool:
    • hive_query: For simple SELECT queries (auto-manages job lifecycle)
    • hive_execute: For write operations or when you need the job_id
    • hive_job_status + hive_job_result: For long-running jobs or recovering results

IMPORTANT These differences should be noted in the function description so that LLM can understand the method differences.

  1. Key differences from Trino:

    • Trino: Synchronous, immediate results
    • Hive: Asynchronous jobs, may take minutes/hours
  2. Usage patterns:

    // Simple query
    hive_query(sql, database)
    
    // Long-running job
    job_id = hive_execute(sql, database)
    status = hive_job_status(job_id)  // Poll until complete
    results = hive_job_result(job_id)  // Fetch when ready
    

Without proper documentation, LLMs will struggle to choose the right tool and handle the async nature correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants