-
Notifications
You must be signed in to change notification settings - Fork 70
Extend recon queries to use normalized column names #1923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend recon queries to use normalized column names #1923
Conversation
✅ 18/18 passed, 1m18s total Running from acceptance #2049 |
c81f00b
to
44ee79e
Compare
8da596b
to
31aa31f
Compare
31aa31f
to
4ccbf05
Compare
7bb888d
to
b6d150f
Compare
4ccbf05
to
f812e9e
Compare
@@ -66,7 +68,25 @@ def filter(self) -> str | None: | |||
|
|||
@property | |||
def user_transformations(self) -> dict[str, str]: | |||
return self._table_conf.get_transformation_dict(self._layer) | |||
if self._table_conf.transformations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved here instead of table as it needs to build the fallback with source delimiters
@@ -103,8 +125,7 @@ def _apply_default_transformation( | |||
with_transform.append(alias.transform(self._default_transformer, schema, source)) | |||
return with_transform | |||
|
|||
@staticmethod | |||
def _default_transformer(node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression: | |||
def _default_transformer(self, node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not static anymore as it needs to return source delimited
@@ -135,6 +136,10 @@ def build_column(this: exp.ExpOrStr, table_name="", quoted=False, alias=None) -> | |||
return exp.Column(this=exp.Identifier(this=this, quoted=quoted), table=table_name) | |||
|
|||
|
|||
def build_column_no_alias(this: str, table_name="") -> exp.Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was added for now instead of changing the previous impl that is used in a lot of places.
a following PR
@@ -132,3 +154,20 @@ def _validate(self, field: set[str] | list[str] | None, message: str): | |||
message = f"Exception for {self.table_conf.target_name} target table in {self.layer} layer --> {message}" | |||
logger.error(message) | |||
raise InvalidInputException(message) | |||
|
|||
def _build_column_with_alias(self, column: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required by many builders
@@ -26,7 +28,7 @@ def __init__(self, table_conf: Table, schema: list[Schema], layer: str, engine: | |||
|
|||
@property | |||
def engine(self) -> Dialect: | |||
return self._engine | |||
return self._engine if self.layer == "source" else get_dialect("databricks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required by many builders
d6d39a5
to
34fc882
Compare
34fc882
to
5e9cb6f
Compare
b6d150f
to
62ab8a6
Compare
5e9cb6f
to
9fdd001
Compare
8e13d83
to
afa6f46
Compare
afa6f46
to
a53985e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I have seen other PRs address more corner case tests.
Reviewing this PR kind of hit home the point of how Scattered Normalization logic has to be.
CC @bishwajit-db and @m-abulazm when we redesign this we need to improve our Interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
3322387
into
feat/recon/add-flag-for-normalization
## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary, they're helpful to illustrate the before and after state --> ### What does this PR do? **Big Picture:** Add support in lakebridge reconciler for columns with special chars e.g. `Cust#` as at the moment reconciler will fail if such columns are passed. **This PR:** The PR adds support for delimiting all identifiers in the SQL queries produced by the different `QueryBuilder`s This is necessary in case a column has a special character in the name ### Relevant implementation details 1. We use spark sql on top of JDBC to execute those queries. This means that the initial reading of the data from the databases is deferred to the source databases 2. SQL is sent directly to the database in that database’s own dialect (e.g., use [Name] for SQL Server identifiers). 3. Spark wraps the provided SQL in a subquery so it can treat results like a table, but the inner SQL is executed by the database. 4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g. compare hash queries ### Caveats/things to watch out for when reviewing: This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6 ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves #1866 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [ ] manually tested - [x] added unit tests - [x] added integration tests
## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary, they're helpful to illustrate the before and after state --> ### What does this PR do? **Big Picture:** Add support in lakebridge reconciler for columns with special chars e.g. `Cust#` as at the moment reconciler will fail if such columns are passed. **This PR:** The PR adds support for delimiting all identifiers in the SQL queries produced by the different `QueryBuilder`s This is necessary in case a column has a special character in the name ### Relevant implementation details 1. We use spark sql on top of JDBC to execute those queries. This means that the initial reading of the data from the databases is deferred to the source databases 2. SQL is sent directly to the database in that database’s own dialect (e.g., use [Name] for SQL Server identifiers). 3. Spark wraps the provided SQL in a subquery so it can treat results like a table, but the inner SQL is executed by the database. 4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g. compare hash queries ### Caveats/things to watch out for when reviewing: This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6 ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves #1866 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [ ] manually tested - [x] added unit tests - [x] added integration tests
Changes
What does this PR do?
Big Picture:
Add support in lakebridge reconciler for columns with special chars e.g.
Cust#
as at the moment reconciler will fail if such columns are passed.This PR:
The PR adds support for delimiting all identifiers in the SQL queries produced by the different
QueryBuilder
sThis is necessary in case a column has a special character in the name
Relevant implementation details
Caveats/things to watch out for when reviewing:
This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6
Linked issues
Resolves #1866
Functionality
databricks labs lakebridge ...
Tests