Skip to content

Conversation

lorenarosati
Copy link
Contributor

@lorenarosati lorenarosati commented Jul 22, 2025

Which issue does this PR close?

Rationale for this change

Virtual tables in Substrait can have either literal values or expressions, and currently, the Substrait consumer only handles the case when values are provided (values is a deprecated field). If there are no values provided, an empty relation is returned, even if the virtual table has expressions specified.

If expressions are provided as elements in a virtual table (rather than literal values), from_read_rel() should return a LogicalPlan with the virtual table elements rather than an empty relation.

What changes are included in this PR?

  • Only returns an empty relation if there are no values and no expressions specified.
  • If expressions is non-empty, we iterate through the records and consume each expression; if expressions is empty, we fallback to iterate through the records and consume each literal.

Are these changes tested?

Yes, a test was added that takes a Substrait plan which has expressions as elements in a virtual table, and checks that the resulting plan is correct (and does not have an empty relation).

Are there any user-facing changes?

These changes will allow users to specify expressions in Substrait virtual tables, and get a result from their query.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Jul 22, 2025
@lorenarosati lorenarosati marked this pull request as draft July 22, 2025 19:16
@lorenarosati lorenarosati marked this pull request as ready for review July 22, 2025 20:28
Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, but overall this looks good to me. Thanks for fixing this ✨

}
exprs.push(row_exprs);
}
exprs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does introduce a bit of duplication with the block below, but given that consume_expression is async and from_substrait_literal is not it would be a bit tricky to unify the processing. In the long-term Substrait drop support for the values field which is deprecated so we'll be able to remove the lower block entirely.

return Ok(LogicalPlan::EmptyRelation(EmptyRelation {
produce_one_row: false,
schema: DFSchemaRef::new(substrait_schema),
}));
}

let values = vt
let values = if vt.values.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I would suggest checking !vt.expressions.is_empty() to make it clearer that expressions are preferred over values. Effectively, it's the difference between "We are processing expressions because there are no values" as opposed to "We are processing expressions because they are present".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change!

@@ -0,0 +1,94 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I would suggest we update the name to something like virtual_table_with_expressions.substrait.json which more accurately reflects what this plan is intended to verify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change!

}
},
"input": {
"read": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I would suggest of getting rid of everything other than the read in this plan, to make it clearer that this is a test of virtual table functionality. The aggregation in this plan is effectively noise.

}
}
]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this a slighly more robust test, I would suggest:

  • Using a struct with more than 1 field to verify the name processing is applied correctly.
  • Having more than 1 row to check that multiple rows can be processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @lorenarosati

I started the CI and gave it a quick skim and it looks good to me (I am deferring most review to @vbarua as he is an expert in this area and familiar with this code)

Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks for updating this @lorenarosati

@alamb this is ready for merge from my perspective

@alamb
Copy link
Contributor

alamb commented Jul 24, 2025

I will plan to merge tomorrow so there is some additional time for other reviewers if they want

for expression in &row.fields {
name_idx += 1;
let expr = consumer
.consume_expression(expression, &substrait_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should be using an empty schema here. The substrait_schema we have access to here is the schema of the ReadRel. The expressions in the VirtualTable cannot reference that schema, or any schema at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change!

@vbarua
Copy link
Contributor

vbarua commented Jul 24, 2025

I will plan to merge tomorrow so there is some additional time for other reviewers if they want

Sounds good. I actually just though of something to update in this PR as well 😅

@lorenarosati lorenarosati requested a review from vbarua July 24, 2025 18:26
Copy link
Contributor

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this @lorenarosati. Now this should really be good to merge 😅

@alamb alamb merged commit 675b96c into apache:main Jul 25, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 25, 2025

Thanks @lorenarosati and @vbarua

adriangb pushed a commit to pydantic/datafusion that referenced this pull request Jul 28, 2025
…che#16857)

* Handle expression and value elements in Substrait VirtualTable

* Added test

* Modified test plan, changed conditional check for clarity

* Consume expressions with empty input schema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure Substrait consumer can handle expressions in VirtualTable
3 participants