diff --git a/rfc/rfc-4/rfc4.md b/rfc/rfc-4/rfc4.md new file mode 100644 index 000000000..57c962afa --- /dev/null +++ b/rfc/rfc-4/rfc4.md @@ -0,0 +1,304 @@ + + +# RFC-[4]: XCatalogSync - Synchronize access control policies across catalogs + +## Proposers + +- @vinishjail97 + +## Approvers + +- Anyone from XTable community can approve/add feedback. + +## Status + +GH Feature Request: + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Today, numerous catalogs have emerged, each with its own specifications for table creation, metadata refreshing, and implementing data governance rules. +This diversity has led to increased complexity and confusion, making it challenging for users to choose the right catalog. To address this challenge, we previously proposed an [RFC](https://github.com/apache/incubator-xtable/pull/605/files) for synchronizing table format metadata across catalogs. + +In this RFC, we extend that vision to focus on synchronizing data governance policies. The aim is to enable policies defined in a source catalog to be seamlessly synchronized with multiple target catalogs. This approach not only simplifies multi-catalog operations but also fosters consistency and reduces the manual effort required to manage governance across a fragmented ecosystem. + +## Motivation +A recent blog post in [Data Engineering Weekly](https://www.dataengineeringweekly.com/p/the-chaos-of-catalogs) highlighted the challenges of managing metadata and data governance in a fragmented ecosystem of catalogs. It emphasized the need for scalable solutions, such as adopting a federated catalog model, to address the operational friction caused by the coexistence of multiple catalogs. + +## Background +An access control policy defines a rule stating, "A principal has specific privileges for a securable object." In the context of data catalogs, these privileges can include actions like SELECT or CREATE statements used in DDL, DML, or DQL queries, the securable objects can range from databases and tables to columns and beyond. When a catalog is connected to a query engine, it enforces these permissions for the principal (user) either directly or by issuing temporary credentials that the query engine can use to execute queries securely. + +In today’s data ecosystem, numerous catalogs exist, each with its own specifications and methods for enforcing access control policies. Some catalogs, like AWS Glue and BigLake, are tightly integrated within their ecosystems, while others rely on credential-sharing approaches to support multiple query engines. Similar to how we have defined [InternalTable](https://github.com/apache/incubator-xtable/blob/main/xtable-api/src/main/java/org/apache/xtable/model/InternalTable.java), we aim to establish a canonical representation for access control policies and synchronize these policies across different catalogs + +## Implementation +After reviewing the specifications of multiple catalogs (HMS, AWS Glue LakeFormation, Unity, Polaris, etc.), we observed that most follow a similar conceptual model for access control, incorporating roles, users, user-groups, privileges, and securable objects. While there are slight variations in naming and nuances, these foundational concepts align closely with the design principles originally established by HMS. + +For example, HMS defines enums for [PrivilegeType](https://learn.microsoft.com/en-us/azure/databricks/data-governance/table-acls/object-privileges#privilege-types) (SELECT, CREATE, MODIFY, USAGE, READ_METADATA, CREATE_NAMED_FUNCTION, MODIFY_CLASSPATH, ALL PRIVILEGES) and [SecurableObjectType](https://learn.microsoft.com/en-us/azure/databricks/data-governance/table-acls/object-privileges#securable-objects) (CATALOG, SCHEMA, TABLE, VIEW, FUNCTION) which form the basis for hierarchical asset management like Catalog ➝ Schema ➝ Table. Many other catalogs, such as Glue LakeFormation, Polaris, and Unity, extend or adapt this approach to fit their specific ecosystems, often expanding the scope of securable objects. Interestingly, these catalogs also share similar [authorization APIs](https://github.com/apache/hive/blame/master/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L3051) originally derived from HMS's thrift spec, enabling operations like granting and listing permissions. + +Below is the first version of the models we will be using internally that allows us to interoperate and synchronize across multiple catalogs, it's not the final one, and we can improve it as we add implementations for source and target catalogs. + +**InternalPrivilege** +``` +/** + * Represents a single privilege assignment for a securable object. + * + *
This defines the kind of operation (e.g., SELECT, CREATE, MODIFY) and whether it is allowed or + * denied. Some catalogs may only accept ALLOW rules and treat all other operations as denied by + * default. + */ +public class InternalPrivilege { + /** + * The type of privilege, such as SELECT, CREATE, or MODIFY. Each implementation can define its + * own set of enums. + */ + InternalPrivilegeType privilegeType; + + /** + * The decision, typically ALLOW or DENY. Some catalogs may not support DENY explicitly, + * defaulting to ALLOW. + */ + String privilegeDecision; +} +``` + +**InternalSecurableObject** +``` +/** + * Represents a securable object in the catalog, which can be managed by access control. + * + *
Examples of securable objects include catalogs, schemas, tables, views, or any other data
+ * objects that require fine-grained privilege management. Each securable object can have one or
+ * more privileges assigned to it.
+ */
+public class InternalSecurableObject {
+ /** The identifier of the securable object. */
+ InternalSecurableObjectIdentifier securableObjectIdentifier;
+ /**
+ * The type of securable object, such as TABLE, VIEW, FUNCTION, etc. Each implementation can
+ * define its own set of enums.
+ */
+ InternalSecurableObjectType securableObjectType;
+ /** The set of privileges assigned to this object. */
+ List This class is useful for governance and compliance scenarios, where an audit trail is
+ * necessary. It can be extended to include additional fields such as reasonForChange or
+ * changeDescription.
+ */
+public class InternalChangeLogInfo {
+ /** The username or identifier of the entity that created this record. */
+ String createdBy;
+
+ /** The username or identifier of the entity that last modified this record. */
+ String lastModifiedBy;
+
+ /** The timestamp when this record was created. */
+ Instant createdAt;
+
+ /** The timestamp when this record was last modified. */
+ Instant lastModifiedAt;
+}
+```
+
+
+**InternalRole**
+```
+/**
+ * Represents a role within the catalog.
+ *
+ * A role can be granted access to multiple securable objects, each with its own set of
+ * privileges. Audit info is stored to track the role's creation and modifications, and a properties
+ * map can hold additional metadata.
+ */
+public class InternalRole {
+ /** The unique name or identifier for the role. */
+ String name;
+
+ /** The list of securable objects this role can access. */
+ List A user may be assigned multiple roles, and can also belong to a specific user group. Audit
+ * information is stored to allow tracking of who created or last modified the user.
+ */
+public class InternalUser {
+ /** The unique name or identifier for the user. */
+ String name;
+
+ /** The list of roles assigned to this user. */
+ List Groups can have multiple roles assigned, and also include audit information to track creation
+ * and modifications.
+ */
+public class InternalUserGroup {
+ /** The unique name or identifier for the user group. */
+ String name;
+
+ /** The list of roles assigned to this group. */
+ List This could be a UUID, timestamp string, or any value that guarantees uniqueness across
+ * snapshots.
+ */
+ String versionId;
+
+ /**
+ * The moment in time when this snapshot was created.
+ *
+ * Useful for maintaining an audit trail or comparing how policies have changed over time.
+ */
+ Instant timestamp;
+
+ /**
+ * A map of user names to {@link InternalUser} objects, capturing individual users' details such
+ * as assigned roles, auditing metadata, etc.
+ */
+ @Builder.Default Map Implementations of this interface are responsible for:
+ *
+ * This method allows you to pull in the catalog’s native policy definitions (e.g., roles,
+ * privileges, user/groups) and map them into a {@link InternalAccessControlPolicySnapshot} so
+ * that they can be managed or merged with your centralized policy framework.
+ *
+ * @return A {@code CatalogAccessControlPolicySnapshot} containing the catalog’s current policies.
+ */
+ InternalAccessControlPolicySnapshot fetchPolicies();
+
+ /**
+ * Pushes the canonical policy snapshot into the target catalog, converting it into the catalog’s
+ * native policy definitions and applying any necessary updates.
+ *
+ * This method typically performs the following steps:
+ *
+ *
+ *
+ */
+public interface CatalogAccessControlPolicySyncClient {
+ /**
+ * Fetches the current policies from the catalog, converting them into the internal canonical
+ * model.
+ *
+ *
+ *
+ *
+ * @param snapshot The access control policy snapshot to be synchronized with the catalog.
+ */
+ void pushPolicies(InternalAccessControlPolicySnapshot snapshot);
+}
+```
+
+
+## Rollout/Adoption Plan
+
+- Are there any breaking changes as part of this new feature/functionality?
+ - None, this is a new functionality providing access control policy synchronization across catalogs.
+- What impact (if any) will there be on existing users?
+ - N/A.
+- If we are changing behavior how will we phase out the older behavior? When will we remove the existing behavior?
+ - N/A
+- If we need special migration tools, describe them here.
+ - N/A
+
+## Test Plan
+
+Based on community feedback, we will determine the initial set of catalogs to support. Two-way policy synchronization will then be validated for these catalogs to ensure functionality and reliability.
\ No newline at end of file