Skip to content

Conversation

@FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

We were a bit far from polars behavior for vertical concat. This PR aims to better align with it.

Polars (and some other backends) do not allow to concatenate with different dtypes. The casting has to happen explicitly rather than implicitly. Hence we now check for schema equality rather than column names only.

The error raised tries to match the (lazy) polars' one.

The schemas of the various frames are evaluated lazily for performance

return str(Path(source))


def validate_concat_vertical_schemas(schemas: Iterable[Mapping[str, DType]]) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By passing an iterable, we can avoid to immediately evaluate all the schemas

msg = (
"'union'/'concat' inputs should all have the same schema,got\n"
f"{_pretty_format_schema(schema_0, index=0)}\n"
" and\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space here is not a typo, that's what polars has πŸ˜‚

@FBruzzesi
Copy link
Member Author

QQ: Should we consider pushing the validation at the narwhals level to have it fully standardized? πŸ€”

@dangotbanned
Copy link
Member

dangotbanned commented Oct 9, 2025

Polars (and some other backends) do not allow to concatenate with different dtypes. The casting has to happen explicitly rather than implicitly. Hence we now check for schema equality rather than column names only.

Could we implement the _relaxed variants first? (After finding which backends support it)

https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

I'm pretty sure that at least pyarrow.concat_tables would work via promote_options="permissive"

IMO, it seems like this change could break things downstream - so having an alternative to suggest may make the transition smoother 🀞

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Oct 9, 2025

Currently (at latest dependencies versions) polars, pyarrow and ibis would raise natively with schema mismatch. Notice that for duckdb and ibis we also already have a schema check, which we don't have for pyspark and dask.

Table view:

Backend Narwhals behavior Native behavior on schema mismatch
polars Nothing It raises two different exceptions (eager vs lazy)
pandas-like Raises on columns mismatch Relaxed
pyarrow Raises on columns mismatch Raises
dask Raises on columns mismatch Relaxed
duckdb Raises on schema mismatch Relaxed
ibis Raises on schema mismatch Raises
spark-like Raises on columns mismatch Relaxed

Could we implement the _relaxed variants first? (After finding which backends support it)

Oh boy, that's definitely a way bigger task than anticipated, and I am not really interested in it to be fair πŸ˜‚

IMO, it seems like this change could break things downstream

  • For the eager case: polars and pyarrow would already raise on schema mismatch, pandas is the only relaxed behavior (which is definitely too relaxed in some cases).
  • For the lazy case: the behavior is mixed and we do not even match the native one.

I understand this is a risk, but to me it seems uncrackable/unreliable from a user perspective. Arguably we could maintain the current behavior for stable.v1, maybe for stable.v2, but for sure I would want the changes in the main namespace (which has a few less promises on stability).

so having an alternative to suggest may make the transition smoother 🀞

Sure, one alternative is to suggest to cast the dtypes before calling nw.concat(..., how="vertical"). Now that I think about it, also for the how="diagonal" case we should check common columns dtypes

@dangotbanned
Copy link
Member

dangotbanned commented Oct 10, 2025

_relaxed variants

Note

I'm about 90% sure I want to split this into another issue
just letting the idea flow out a bit πŸ˜‰

Could we implement the _relaxed variants first? (After finding which backends support it)

Oh boy, that's definitely a way bigger task than anticipated, and I am not really interested in it to be fair πŸ˜‚

That's fair!

I was thinking if a backend only had a strict UNION, then we'd just need to implement a version of polars' supertyping rules on our DTypes:

Or I guess the DuckDB rules have a nicer visual to digest πŸ˜„

Show Casting Operations Matrix

typecasting-matrix

A nice thing about that would be we can ensure consistency by performing the same casts for every backend.
So maybe the implementation would just live in narwhals.functions?
E.g. "vertical_relaxed", just applies the supertyping rules at the narwhals-level, then calls concat(..., how="vertical")

@dangotbanned
Copy link
Member

so having an alternative to suggest may make the transition smoother 🀞

Sure, one alternative is to suggest to cast the dtypes before calling nw.concat(..., how="vertical").

It would be quite natural to do this with DataFrame.cast and selectors.
I'd love to revive an old PR of yours for this πŸ™‚

I found myself doing something pretty similar the other day, but working directly on pyarrow.{DataType,Schema}

def cast_schema(
native: pa.Schema, target_types: DataType | Mapping[str, DataType] | DataTypeRemap
) -> pa.Schema:
if isinstance(target_types, pa.DataType):
return pa.schema((name, target_types) for name in native.names)
if _is_into_pyarrow_schema(target_types):
new_schema = native
for name, dtype in target_types.items():
index = native.get_field_index(name)
new_schema.set(index, native.field(index).with_type(dtype))
return new_schema
return pa.schema((fld.name, target_types.get(fld.type, fld.type)) for fld in native)
def cast_table(
native: pa.Table, target: DataType | IntoArrowSchema | DataTypeRemap
) -> pa.Table:
s = target if isinstance(target, pa.Schema) else cast_schema(native.schema, target)
return native.cast(s)

@FBruzzesi
Copy link
Member Author

<rant>How many different error messages did polars change?<\rant>

@FBruzzesi FBruzzesi changed the title chore: Align vertical concat behavior chore: Align nw.concat(..., how="vertical") behavior across backend for schema mismatch Oct 10, 2025
@MarcoGorelli
Copy link
Member

thanks @FBruzzesi !

i'm not totally sure, my original hesitation for being as strict as polars was that in other backends it's easier to accidentally change the datatype (e.g. if a missing value gets inserted), although it is encouraging that the downstream tests pass πŸ€”

@FBruzzesi
Copy link
Member Author

Hey @MarcoGorelli πŸ‘‹πŸΌ

i'm not totally sure, my original hesitation for being as strict as polars was that in other backends it's easier to accidentally change the datatype (e.g. if a missing value gets inserted)

Doesn't this apply only to pandas non-nullable dtypes?

although it is encouraging that the downstream tests pass πŸ€”

Not sure how used and tested this feature is to be honest.


I will move it back to draft state and make a proposal that is also consistent for how="diagonal", and even see how easy it is to implement the "_relaxed" versions as suggested by @dangotbanned. I had an idea but no time to try it out yet :)

@FBruzzesi FBruzzesi marked this pull request as draft October 13, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants