Extend recon queries to use normalized column names #1923

m-abulazm · 2025-08-11T15:47:03Z

Changes

What does this PR do?

Big Picture:
Add support in lakebridge reconciler for columns with special chars e.g. Cust# as at the moment reconciler will fail if such columns are passed.

This PR:
The PR adds support for delimiting all identifiers in the SQL queries produced by the different QueryBuilders
This is necessary in case a column has a special character in the name

Relevant implementation details

We use spark sql on top of JDBC to execute those queries. This means that the initial reading of the data from the databases is deferred to the source databases
SQL is sent directly to the database in that database’s own dialect (e.g., use [Name] for SQL Server identifiers).
Spark wraps the provided SQL in a subquery so it can treat results like a table, but the inner SQL is executed by the database.
Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g. compare hash queries

Caveats/things to watch out for when reviewing:

This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6

Linked issues

Resolves #1866

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs lakebridge ...

Tests

manually tested
added unit tests
added integration tests

github-actions · 2025-08-11T15:54:33Z

✅ 18/18 passed, 1m18s total

_{Running from acceptance #2049}

m-abulazm · 2025-08-13T08:38:08Z

src/databricks/labs/lakebridge/reconcile/query_builder/base.py

@@ -66,7 +68,25 @@ def filter(self) -> str | None:

    @property
    def user_transformations(self) -> dict[str, str]:
-        return self._table_conf.get_transformation_dict(self._layer)
+        if self._table_conf.transformations:


moved here instead of table as it needs to build the fallback with source delimiters

m-abulazm · 2025-08-13T08:39:28Z

src/databricks/labs/lakebridge/reconcile/query_builder/base.py

@@ -103,8 +125,7 @@ def _apply_default_transformation(
            with_transform.append(alias.transform(self._default_transformer, schema, source))
        return with_transform

-    @staticmethod
-    def _default_transformer(node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression:
+    def _default_transformer(self, node: exp.Expression, schema: list[Schema], source: Dialect) -> exp.Expression:


not static anymore as it needs to return source delimited

m-abulazm · 2025-08-13T08:43:23Z

src/databricks/labs/lakebridge/reconcile/query_builder/expression_generator.py

@@ -135,6 +136,10 @@ def build_column(this: exp.ExpOrStr, table_name="", quoted=False, alias=None) ->
    return exp.Column(this=exp.Identifier(this=this, quoted=quoted), table=table_name)


+def build_column_no_alias(this: str, table_name="") -> exp.Expression:


this was added for now instead of changing the previous impl that is used in a lot of places.
a following PR

m-abulazm · 2025-08-13T08:44:11Z

src/databricks/labs/lakebridge/reconcile/query_builder/base.py

@@ -132,3 +154,20 @@ def _validate(self, field: set[str] | list[str] | None, message: str):
            message = f"Exception for {self.table_conf.target_name} target table in {self.layer} layer --> {message}"
            logger.error(message)
            raise InvalidInputException(message)
+
+    def _build_column_with_alias(self, column: str):


required by many builders

m-abulazm · 2025-08-13T08:44:24Z

src/databricks/labs/lakebridge/reconcile/query_builder/base.py

@@ -26,7 +28,7 @@ def __init__(self, table_conf: Table, schema: list[Schema], layer: str, engine:

    @property
    def engine(self) -> Dialect:
-        return self._engine
+        return self._engine if self.layer == "source" else get_dialect("databricks")


required by many builders

sundarshankar89

LGTM, I have seen other PRs address more corner case tests.
Reviewing this PR kind of hit home the point of how Scattered Normalization logic has to be.

CC @bishwajit-db and @m-abulazm when we redesign this we need to improve our Interfaces.

bishwajit-db

LGTM

This reverts commit 3322387.

## Changes  ### What does this PR do? **Big Picture:** Add support in lakebridge reconciler for columns with special chars e.g. `Cust#` as at the moment reconciler will fail if such columns are passed. **This PR:** The PR adds support for delimiting all identifiers in the SQL queries produced by the different `QueryBuilder`s This is necessary in case a column has a special character in the name ### Relevant implementation details 1. We use spark sql on top of JDBC to execute those queries. This means that the initial reading of the data from the databases is deferred to the source databases 2. SQL is sent directly to the database in that database’s own dialect (e.g., use [Name] for SQL Server identifiers). 3. Spark wraps the provided SQL in a subquery so it can treat results like a table, but the inner SQL is executed by the database. 4. Inside Spark, queries are written in Spark’s ANSI-like dialect. e.g. compare hash queries ### Caveats/things to watch out for when reviewing: This is a big change that had the implementation split between 6 PRs to make reviews manageable. this cannot go to main without the rest of the PRs. this is PR 6/6 ### Linked issues  Resolves #1866 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` ### Tests  - [ ] manually tested - [x] added unit tests - [x] added integration tests

m-abulazm had a problem deploying to tool August 11, 2025 15:47 — with GitHub Actions Failure

m-abulazm changed the base branch from main to bug/recon/1866-extend-base-query-builder August 11, 2025 15:47

m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch from c81f00b to 44ee79e Compare August 11, 2025 16:42

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 8da596b to 31aa31f Compare August 11, 2025 16:47

m-abulazm had a problem deploying to tool August 11, 2025 16:47 — with GitHub Actions Failure

m-abulazm self-assigned this Aug 12, 2025

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 31aa31f to 4ccbf05 Compare August 12, 2025 08:34

m-abulazm had a problem deploying to tool August 12, 2025 08:34 — with GitHub Actions Failure

m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch 2 times, most recently from 7bb888d to b6d150f Compare August 12, 2025 09:08

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 4ccbf05 to f812e9e Compare August 12, 2025 12:04

m-abulazm temporarily deployed to tool August 12, 2025 12:04 — with GitHub Actions Inactive

m-abulazm temporarily deployed to tool August 12, 2025 13:43 — with GitHub Actions Inactive

m-abulazm temporarily deployed to tool August 12, 2025 15:11 — with GitHub Actions Inactive

m-abulazm marked this pull request as ready for review August 12, 2025 15:12

m-abulazm requested a review from a team as a code owner August 12, 2025 15:12

m-abulazm temporarily deployed to tool August 12, 2025 15:43 — with GitHub Actions Inactive

m-abulazm temporarily deployed to tool August 13, 2025 08:56 — with GitHub Actions Inactive

m-abulazm commented Aug 13, 2025

View reviewed changes

m-abulazm added feat/recon making sure that remorphed query produces the same results as original internal technical pr's not end user facing stacked PR Should be reviewed, but not merged labels Aug 13, 2025

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from d6d39a5 to 34fc882 Compare August 13, 2025 09:08

m-abulazm temporarily deployed to tool August 13, 2025 09:08 — with GitHub Actions Inactive

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 34fc882 to 5e9cb6f Compare August 13, 2025 09:15

m-abulazm temporarily deployed to tool August 13, 2025 09:15 — with GitHub Actions Inactive

m-abulazm force-pushed the bug/recon/1866-extend-base-query-builder branch from b6d150f to 62ab8a6 Compare August 18, 2025 13:16

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 5e9cb6f to 9fdd001 Compare August 18, 2025 13:17

m-abulazm temporarily deployed to tool August 18, 2025 13:17 — with GitHub Actions Inactive

m-abulazm added 12 commits August 22, 2025 14:55

Green test_reconcile_data_with_mismatches_and_missing_special_chars

7c8b6cf

Fix threshold query with green specs

a16c9a6

formatting

9c252e4

introduce fixture file for query testing fixtures

1963ab4

fix green spec of test_hash_query

3acf61f

fix rest of specs hash query

8c22b4a

remove integration test for now

d37c02a

formatting

2b5bc01

fix sampling query

c372166

Fix execute tests

9b38c7c

Fix fallback of missing transform

7bcf118

formatting

5873077

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from 8e13d83 to afa6f46 Compare August 22, 2025 12:55

m-abulazm temporarily deployed to tool August 22, 2025 12:55 — with GitHub Actions Inactive

m-abulazm changed the base branch from main to feat/recon/add-flag-for-normalization August 22, 2025 12:56

m-abulazm added 2 commits August 22, 2025 16:03

DRY duplicated code

28f3e33

small refactoring in hash_query.py

a53985e

m-abulazm force-pushed the bug/recon/1866-fix-recon-queries branch from afa6f46 to a53985e Compare August 22, 2025 14:03

m-abulazm temporarily deployed to tool August 22, 2025 14:03 — with GitHub Actions Inactive

sundarshankar89 approved these changes Aug 26, 2025

View reviewed changes

sundarshankar89 requested a review from a team August 26, 2025 04:27

gueniai approved these changes Aug 29, 2025

View reviewed changes

bishwajit-db approved these changes Sep 1, 2025

View reviewed changes

m-abulazm merged commit 3322387 into feat/recon/add-flag-for-normalization Sep 1, 2025
9 checks passed

m-abulazm deleted the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:22

m-abulazm restored the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:41

m-abulazm added a commit that referenced this pull request Sep 1, 2025

Revert "Extend recon queries to use normalized column names (#1923)"

d2113f8

This reverts commit 3322387.

m-abulazm deleted the bug/recon/1866-fix-recon-queries branch September 1, 2025 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend recon queries to use normalized column names #1923

Extend recon queries to use normalized column names #1923

Uh oh!

m-abulazm commented Aug 11, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

m-abulazm Aug 13, 2025

Uh oh!

m-abulazm Aug 13, 2025

Uh oh!

m-abulazm Aug 13, 2025

Uh oh!

m-abulazm Aug 13, 2025

Uh oh!

m-abulazm Aug 13, 2025

Uh oh!

sundarshankar89 left a comment •

edited

Loading

Uh oh!

bishwajit-db left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -135,6 +136,10 @@ def build_column(this: exp.ExpOrStr, table_name="", quoted=False, alias=None) ->
		return exp.Column(this=exp.Identifier(this=this, quoted=quoted), table=table_name)


		def build_column_no_alias(this: str, table_name="") -> exp.Expression:

Extend recon queries to use normalized column names #1923

Extend recon queries to use normalized column names #1923

Uh oh!

Conversation

m-abulazm commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

What does this PR do?

Relevant implementation details

Caveats/things to watch out for when reviewing:

Linked issues

Functionality

Tests

Uh oh!

github-actions bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-abulazm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

m-abulazm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

m-abulazm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

m-abulazm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

m-abulazm Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

sundarshankar89 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bishwajit-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

m-abulazm commented Aug 11, 2025 •

edited

Loading

github-actions bot commented Aug 11, 2025 •

edited

Loading

sundarshankar89 left a comment •

edited

Loading