-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat(native): Support custom schemas in native sidecar function registry #26236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuideImplements catalog-scoped function metadata retrieval and registration by extending the native sidecar API, HTTP endpoint, client provider, connector integration for Hive functions, OpenAPI spec, and corresponding tests to support custom schemas. Class diagram for NativeFunctionDefinitionProvider and catalog-scoped retrievalclassDiagram
class NativeFunctionDefinitionProvider {
- JsonCodec nativeFunctionSignatureMapJsonCodec
- HttpClient httpClient
- NativeFunctionNamespaceManagerConfig config
- String catalogName
+ NativeFunctionDefinitionProvider(..., String catalogName)
+ UdfFunctionSignatureMap getUdfDefinition(NodeManager)
}
class UdfFunctionSignatureMap {
+ UdfFunctionSignatureMap(Map<String, List<JsonBasedUdfFunctionMetadata>>)
}
NativeFunctionDefinitionProvider --> UdfFunctionSignatureMap
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
1c9bf83
to
8ace7d7
Compare
8ace7d7
to
3767a71
Compare
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The new GET /v1/functions/{catalog} handler always returns 200 with null or empty JSON even for non‐existent catalogs—modify it to return a 404 when no functions are found to match the OpenAPI spec.
- The
/v1/functions/{catalog}
path conflicts with the existing/v1/functions/{schema}
route—consider renaming or ordering the handlers to avoid ambiguous routing. - The hard-coded blocklist in getFunctionsMetadata is scattered in the implementation—extract it into a shared constant or make it configurable to simplify future updates.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new GET /v1/functions/{catalog} handler always returns 200 with null or empty JSON even for non‐existent catalogs—modify it to return a 404 when no functions are found to match the OpenAPI spec.
- The `/v1/functions/{catalog}` path conflicts with the existing `/v1/functions/{schema}` route—consider renaming or ordering the handlers to avoid ambiguous routing.
- The hard-coded blocklist in getFunctionsMetadata is scattered in the implementation—extract it into a shared constant or make it configurable to simplify future updates.
## Individual Comments
### Comment 1
<location> `presto-native-execution/presto_cpp/main/functions/FunctionMetadata.cpp:341` </location>
<code_context>
+ continue;
+ }
+
+ const auto parts = getFunctionNameParts(name);
+ if (parts[0] != catalog) {
+ continue;
</code_context>
<issue_to_address>
**issue:** Potential out-of-bounds access in 'parts' array.
Validate that 'parts' contains at least three elements before accessing indices 1 and 2 to prevent undefined behavior.
</issue_to_address>
### Comment 2
<location> `presto-native-execution/presto_cpp/main/connectors/PrestoToVeloxConnector.cpp:63-65` </location>
<code_context>
+
+ // Register hive-specific functions when hive catalog is detected.
+ // Delegate to generic Hive native function registrar which is idempotent.
+ if (connectorName ==
+ velox::connector::hive::HiveConnectorFactory::kHiveConnectorName ||
+ connectorName == std::string("hive-hadoop2")) {
+ hive::functions::registerHiveNativeFunctions();
+ }
</code_context>
<issue_to_address>
**suggestion:** Connector name comparison may be brittle.
Hardcoding connector names increases maintenance risk if new variants are added. Centralize connector name definitions or use a more robust matching approach.
Suggested implementation:
```cpp
// Register hive-specific functions when hive catalog is detected.
// Delegate to generic Hive native function registrar which is idempotent.
if (isHiveConnector(connectorName)) {
hive::functions::registerHiveNativeFunctions();
}
```
```cpp
#include "velox/functions/FunctionRegistry.h"
#include <unordered_set>
```
```cpp
connectorName);
protocol::registerConnectorProtocol(
connectorName, std::move(connectorProtocol));
// Centralized Hive connector name definitions.
namespace {
const std::unordered_set<std::string> kHiveConnectorNames = {
velox::connector::hive::HiveConnectorFactory::kHiveConnectorName,
"hive-hadoop2"
};
bool isHiveConnector(const std::string& connectorName) {
return kHiveConnectorNames.count(connectorName) > 0;
}
} // namespace
```
</issue_to_address>
### Comment 3
<location> `presto-native-execution/presto_cpp/main/connectors/hive/functions/InitcapFunction.h:30` </location>
<code_context>
+struct InitCapFunction {
+ VELOX_DEFINE_FUNCTION_TYPES(T);
+
+ static constexpr bool is_default_ascii_behavior = true;
+
+ FOLLY_ALWAYS_INLINE void call(
</code_context>
<issue_to_address>
**nitpick:** Unused 'is_default_ascii_behavior' constant.
If this constant is not needed, please remove it. If it is reserved for future use or external access, add a comment explaining its purpose.
</issue_to_address>
### Comment 4
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:117` </location>
<code_context>
testFunction("variance", "Variance.json", 5);
}
+
+TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCatalog) {
+ // Test with the "presto" catalog that is registered in SetUpTestSuite
+ std::string catalog = "presto";
</code_context>
<issue_to_address>
**suggestion (testing):** Missing test coverage for custom schemas and non-default schemas.
Please add tests for functions registered under non-default schemas to ensure getFunctionsMetadata returns correct metadata for those cases.
Suggested implementation:
```cpp
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCatalog) {
// Test with the "presto" catalog that is registered in SetUpTestSuite
std::string catalog = "presto";
auto metadata = getFunctionsMetadata(catalog);
// The result should be a JSON object with function names as keys
ASSERT_TRUE(metadata.is_object());
ASSERT_FALSE(metadata.empty());
// Verify that common functions are present
ASSERT_TRUE(metadata.contains("abs"));
ASSERT_TRUE(metadata.contains("mod"));
}
// Register a function under a custom schema for testing purposes.
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithCustomSchema) {
// Assume registerFunction is available for registering test functions.
// Register a function "custom_func" under schema "custom_schema".
std::string customCatalog = "presto";
std::string customSchema = "custom_schema";
std::string functionName = "custom_func";
// The registration API may differ; adjust as needed for your codebase.
registerFunction(customCatalog, customSchema, functionName, /*function implementation*/ nullptr);
auto metadata = getFunctionsMetadata(customCatalog);
// The result should include the custom function under the custom schema.
ASSERT_TRUE(metadata.is_object());
ASSERT_TRUE(metadata.contains(functionName));
// Optionally, check that the schema is correct in the metadata.
ASSERT_EQ(metadata[functionName]["schema"], customSchema);
}
// Register a function under another non-default schema for additional coverage.
TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonDefaultSchema) {
std::string catalog = "presto";
std::string nonDefaultSchema = "analytics";
std::string functionName = "analytics_func";
registerFunction(catalog, nonDefaultSchema, functionName, /*function implementation*/ nullptr);
auto metadata = getFunctionsMetadata(catalog);
ASSERT_TRUE(metadata.is_object());
ASSERT_TRUE(metadata.contains(functionName));
ASSERT_EQ(metadata[functionName]["schema"], nonDefaultSchema);
}
```
- Ensure that the `registerFunction` API exists and is accessible in your test environment. If not, you may need to use the actual function registration mechanism used in your codebase.
- If the metadata structure differs (e.g., schema is not a direct property), adjust the assertions accordingly.
- If setup/teardown is required for custom functions, add appropriate cleanup code.
</issue_to_address>
### Comment 5
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:150` </location>
<code_context>
+ }
+}
+
+TEST_F(FunctionMetadataTest, GetFunctionsMetadataWithNonExistentCatalog) {
+ // Test with a catalog that doesn't exist
+ std::string catalog = "nonexistent";
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding a test for error conditions or malformed catalog names.
Please include test cases for malformed catalog names, such as empty strings or special characters, to verify robust error handling.
</issue_to_address>
### Comment 6
<location> `presto-native-execution/presto_cpp/main/functions/tests/FunctionMetadataTest.cpp:143` </location>
<code_context>
+ ASSERT_TRUE(signature.contains("functionKind"))
+ << "Function: " << it.key();
+
+ // Schema should be "default" since we registered with "presto.default."
+ // prefix
+ EXPECT_EQ(signature["schema"], "default") << "Function: " << it.key();
</code_context>
<issue_to_address>
**nitpick (testing):** Test assertions could be more robust for schema values.
Hardcoding 'default' for the schema may cause the test to fail if registration logic changes or custom schemas are used. Parameterize the expected schema or derive it from the registration logic to improve test resilience.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
continue; | ||
} | ||
|
||
const auto parts = getFunctionNameParts(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Potential out-of-bounds access in 'parts' array.
Validate that 'parts' contains at least three elements before accessing indices 1 and 2 to prevent undefined behavior.
struct InitCapFunction { | ||
VELOX_DEFINE_FUNCTION_TYPES(T); | ||
|
||
static constexpr bool is_default_ascii_behavior = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Unused 'is_default_ascii_behavior' constant.
If this constant is not needed, please remove it. If it is reserved for future use or external access, add a comment explaining its purpose.
ASSERT_TRUE(signature.contains("functionKind")) | ||
<< "Function: " << it.key(); | ||
|
||
// Schema should be "default" since we registered with "presto.default." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (testing): Test assertions could be more robust for schema values.
Hardcoding 'default' for the schema may cause the test to fail if registration logic changes or custom schemas are used. Parameterize the expected schema or derive it from the registration logic to improve test resilience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Joe-Abraham ,
From a high-level, the changes look good to me. Will do a more thorough analysis later.
Can you write a small RFC so we can get more feedback on the architecture?
Description
Fixes
Motivation and Context
Impact
Test Plan
Contributor checklist