Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Nov 26, 2025

What changes were proposed in this pull request?

#52225 allow MERGE INTO to support case where assignment value is a struct with less fields than the assignment key, ie UPDATE SET big_struct = source.small_struct.

This makes this feature off by default, and turned on via a config.

Why are the changes needed?

The change brought some interesting question, for example there is some ambiguity in user intent. Does the UPDATE SET * mean set all nested fields or top level columns? In the first case, missing fields are kept. In the second case, missing fields are nullified.

I tried to make a choice in #53149 but after some feedback, it may be a bit controversial, choosing one interpretation over another. A SQLConf may not be the right choice, and instead we may need to introduce some new syntax, which require more discussion.

Does this PR introduce any user-facing change?

No this feature is unreleased

How was this patch tested?

Existing unit test

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 26, 2025
def coerceMergeNestedTypes: Boolean =
getConf(SQLConf.MERGE_INTO_NESTED_TYPE_COERCION_ENABLED)
// Disable until we define the semantics of UPDATE SET * with nested types
def coerceMergeNestedTypes: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just turn off the config by default without removing tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that'd be great

@corleyma
Copy link

Is it possible to make this a behavior that folks can opt into while the semantics are being sorted out? I would find the new behavior very useful, and it's a bit sad to leave it disabled when it's implemented. The gymnastics I do to handle this today isn't fun.

@szehon-ho
Copy link
Member Author

szehon-ho commented Nov 26, 2025

sure, as per @cloud-fan 's comment we will disable by config. I will mark config as experimental and know the semantics for nested field assignment may change in future release, if they are not matching in schema

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @szehon-ho . Looking forward to seeing the final status of this PR.

BTW, when you make a PR for master branch, it affects Apache Spark 4.2.0 too. So, please remove this wording from this PR.

This change disable it for Spark 4.1.

@szehon-ho
Copy link
Member Author

Updated pr description and pr.

Because now I have a config to enable it, I reverted the more controversial pr #53149 manually, it should revert to the original simpler behavior of replacing structs at column level for UPDATE SET *.

Thanks!

|s STRUCT<c1: INT, c2: STRUCT<a: ARRAY<INT>, m: MAP<STRING, STRING>>>,
|dep STRING""".stripMargin,
"""{ "pk": 1, "s": { "c1": 2, "c2": { "a": [1,2], "m": { "a": "b" } } }, "dep": "hr" }""")
Seq(true, false).foreach { coerceNestedTypes =>
Copy link
Member Author

@szehon-ho szehon-ho Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: most of these changes is because coerceNestedTypes was true by default, now its false, so i add another dimension to these tests

@dongjoon-hyun dongjoon-hyun self-assigned this Nov 26, 2025
@szehon-ho szehon-ho changed the title [SPARK-54525][SQL] Disable nested struct coercion in MERGE INTO [SPARK-54525][SQL] Disable nested struct coercion in MERGE INTO under a config Nov 26, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the direction. Let's see the CI result first. Thanks.

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 27, 2025

to share some initial thoughts:

  • The behavior of SET target.struct_col = source.struct_col is quite intuitive: assignment means we fully assign the right side value to the left side variable. After schema evolution or coercion, these two structs should have the same schema, and we simply write the source struct to the table, no matter it's null or it has null fields.
  • If we don't want this "full assignment" behavior, but retain the original values of some target struct fields that are not present in the source struct, we can follow SET * and support this syntax: SET target.struct_col.* = source.struct_col.*.
  • Let's say if the struct is deeply nested, and we want to control how deep we expand the fields during assignment, we can make the syntax more flexible. For example, SET target.struct_col.*.* = source.struct_col.*.* means expand two levels, .** or .*..* means expands all levels until hit the leaf field.
  • Table row is similar to a struct, so the syntax should be consistent. e.g. UPDATE SET * only expand one level, which means assign top-level columns from source to target. UPDATE SET *.* means two levels and UPDATE SET ** means all levels. For full assignment, it's like UPDATE SET target_table = source_table, but we likely don't want it as it's very weird.
  • In reality, we likely don't need to be so flexible, having * and ** should be good enough for expanding one level and all levels.

@szehon-ho let's keep the existing code if they may still be useful to implement the new syntaxes.

@dongjoon-hyun
Copy link
Member

Could you address the above comment, @szehon-ho ?

@dongjoon-hyun
Copy link
Member

To @szehon-ho and @cloud-fan , let me merge this AS-IS status to unblock Apache Spark 4.1.0.

For the last comment, we can add back cleanly to master branch for Spark 4.2.0 later.

let's keep the existing code if they may still be useful to implement the new syntaxes.

Thank you.

dongjoon-hyun pushed a commit that referenced this pull request Nov 29, 2025
… a config

### What changes were proposed in this pull request?
#52225 allow MERGE INTO to support case where assignment value is a struct with less fields than the assignment key, ie UPDATE SET big_struct = source.small_struct.

This makes this feature off by default, and turned on via a config.

### Why are the changes needed?

The change brought some interesting question, for example there is some ambiguity in user intent.  Does the UPDATE SET * mean set all nested fields or top level columns?  In the first case, missing fields are kept.  In the second case, missing fields are nullified.

I tried to make a choice in #53149 but after some feedback, it may be a bit controversial, choosing one interpretation over another.  A SQLConf may not be the right choice, and instead we may need to introduce some new syntax, which require more discussion.

### Does this PR introduce _any_ user-facing change?
No this feature is unreleased

### How was this patch tested?
Existing unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #53229 from szehon-ho/disable_merge_update_source_coercion.

Authored-by: Szehon Ho <szehon.apache@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 23d9253)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun removed their assignment Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants