Skip to content

Conversation

m-abulazm
Copy link
Contributor

@m-abulazm m-abulazm commented Aug 11, 2025

Changes

What does this PR do?

Big Picture:
Add support in lakebridge reconciler for columns with special chars e.g. Cust# as at the moment reconciler will fail if such columns are passed.

This PR:
The PR adds support for delimiting all identifiers in the SQL queries produced by the different QueryBuilders
This is necessary in case a column has a special character in the name

Relevant implementation details

  1. We use spark sql on top of JDBC to execute those queries. This means that the initial reading of the data from the databases is deferred to the source databases
  2. SQL is sent directly to the database in that database’s own dialect (e.g., use [Name] for SQL Server identifiers).
  3. Spark wraps the provided SQL in a subquery so it can treat results like a table, but the inner SQL is executed by the database.
  4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g. compare hash queries

Caveats/things to watch out for when reviewing:

This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6

Linked issues

Resolves #1866

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: databricks labs lakebridge ...

Tests

  • manually tested
  • added unit tests
  • added integration tests

@m-abulazm m-abulazm changed the base branch from main to bug/recon/1866-extend-base-query-builder August 11, 2025 15:47
Copy link

github-actions bot commented Aug 11, 2025

✅ 18/18 passed, 1m18s total

Running from acceptance #2049

@m-abulazm m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch from c81f00b to 44ee79e Compare August 11, 2025 16:42
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 8da596b to 31aa31f Compare August 11, 2025 16:47
@m-abulazm m-abulazm self-assigned this Aug 12, 2025
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 31aa31f to 4ccbf05 Compare August 12, 2025 08:34
@m-abulazm m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch 2 times, most recently from 7bb888d to b6d150f Compare August 12, 2025 09:08
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 4ccbf05 to f812e9e Compare August 12, 2025 12:04
@m-abulazm m-abulazm marked this pull request as ready for review August 12, 2025 15:12
@m-abulazm m-abulazm requested a review from a team as a code owner August 12, 2025 15:12
@@ -66,7 +68,25 @@ def filter(self) -> str | None:

@property
def user_transformations(self) -> dict[str, str]:
return self._table_conf.get_transformation_dict(self._layer)
if self._table_conf.transformations:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved here instead of table as it needs to build the fallback with source delimiters

@@ -103,8 +125,7 @@ def _apply_default_transformation(
with_transform.append(alias.transform(self._default_transformer, schema, source))
return with_transform

@staticmethod
def _default_transformer(node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression:
def _default_transformer(self, node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not static anymore as it needs to return source delimited

@@ -135,6 +136,10 @@ def build_column(this: exp.ExpOrStr, table_name="", quoted=False, alias=None) ->
return exp.Column(this=exp.Identifier(this=this, quoted=quoted), table=table_name)


def build_column_no_alias(this: str, table_name="") -> exp.Expression:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added for now instead of changing the previous impl that is used in a lot of places.
a following PR

@@ -132,3 +154,20 @@ def _validate(self, field: set[str] | list[str] | None, message: str):
message = f"Exception for {self.table_conf.target_name} target table in {self.layer} layer --> {message}"
logger.error(message)
raise InvalidInputException(message)

def _build_column_with_alias(self, column: str):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required by many builders

@@ -26,7 +28,7 @@ def __init__(self, table_conf: Table, schema: list[Schema], layer: str, engine:

@property
def engine(self) -> Dialect:
return self._engine
return self._engine if self.layer == "source" else get_dialect("databricks")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required by many builders

@m-abulazm m-abulazm added feat/recon making sure that remorphed query produces the same results as original internal technical pr's not end user facing stacked PR Should be reviewed, but not merged labels Aug 13, 2025
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from d6d39a5 to 34fc882 Compare August 13, 2025 09:08
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 34fc882 to 5e9cb6f Compare August 13, 2025 09:15
@m-abulazm m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch from b6d150f to 62ab8a6 Compare August 18, 2025 13:16
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 5e9cb6f to 9fdd001 Compare August 18, 2025 13:17
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 8e13d83 to afa6f46 Compare August 22, 2025 12:55
@m-abulazm m-abulazm changed the base branch from main to feat/recon/add-flag-for-normalization August 22, 2025 12:56
@m-abulazm m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from afa6f46 to a53985e Compare August 22, 2025 14:03
Copy link
Collaborator

@sundarshankar89 sundarshankar89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have seen other PRs address more corner case tests.
Reviewing this PR kind of hit home the point of how Scattered Normalization logic has to be.

CC @bishwajit-db and @m-abulazm when we redesign this we need to improve our Interfaces.

@sundarshankar89 sundarshankar89 requested a review from a team August 26, 2025 04:27
Copy link
Contributor

@bishwajit-db bishwajit-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@m-abulazm m-abulazm merged commit 3322387 into feat/recon/add-flag-for-normalization Sep 1, 2025
9 checks passed
@m-abulazm m-abulazm deleted the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:22
@m-abulazm m-abulazm restored the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:41
m-abulazm added a commit that referenced this pull request Sep 1, 2025
@m-abulazm m-abulazm deleted the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:41
m-abulazm added a commit that referenced this pull request Sep 1, 2025
## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary, they're helpful to illustrate the before and
after state -->
### What does this PR do?
**Big Picture:** 
Add support in lakebridge reconciler for columns with special chars e.g.
`Cust#` as at the moment reconciler will fail if such columns are
passed.

**This PR:**
The PR adds support for delimiting all identifiers in the SQL queries
produced by the different `QueryBuilder`s
This is necessary in case a column has a special character in the name

### Relevant implementation details
1. We use spark sql on top of JDBC to execute those queries. This means
that the initial reading of the data from the databases is deferred to
the source databases
2. SQL is sent directly to the database in that database’s own dialect
(e.g., use [Name] for SQL Server identifiers).
3. Spark wraps the provided SQL in a subquery so it can treat results
like a table, but the inner SQL is executed by the database.
4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g.
compare hash queries


### Caveats/things to watch out for when reviewing:
This is a big change that had the implementation split between 6 PRs to
make reviews manageable. this cannot go to main without the rest of the
PRs. this is PR 6/6

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #1866 

### Functionality

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs lakebridge ...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [x] added unit tests
- [x] added integration tests
m-abulazm added a commit that referenced this pull request Sep 3, 2025
## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary, they're helpful to illustrate the before and
after state -->
### What does this PR do?
**Big Picture:** 
Add support in lakebridge reconciler for columns with special chars e.g.
`Cust#` as at the moment reconciler will fail if such columns are
passed.

**This PR:**
The PR adds support for delimiting all identifiers in the SQL queries
produced by the different `QueryBuilder`s
This is necessary in case a column has a special character in the name

### Relevant implementation details
1. We use spark sql on top of JDBC to execute those queries. This means
that the initial reading of the data from the databases is deferred to
the source databases
2. SQL is sent directly to the database in that database’s own dialect
(e.g., use [Name] for SQL Server identifiers).
3. Spark wraps the provided SQL in a subquery so it can treat results
like a table, but the inner SQL is executed by the database.
4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g.
compare hash queries


### Caveats/things to watch out for when reviewing:
This is a big change that had the implementation split between 6 PRs to
make reviews manageable. this cannot go to main without the rest of the
PRs. this is PR 6/6

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #1866 

### Functionality

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs lakebridge ...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [x] added unit tests
- [x] added integration tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/recon making sure that remorphed query produces the same results as original internal technical pr's not end user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: TSQL Does not accept special character in Reconciler Column Mapping
4 participants