Add materialized views RFC #44

tdcmeehan · 2025-08-14T03:13:46Z

No description provided.

jainxrohit

@tdcmeehan Thanks for this RFC. Meta is also planning to resume the work in the materialized views. Back in the days, I had initially tried to build this initially in the planning phase; however, that led to really complex and messy structures, which were very difficult to work with. We probably need even higher level abstraction. Let's align with @arhimondr, @kaikalur and @zation99 on this.

kaikalur · 2025-08-21T13:50:12Z

@tdcmeehan Thanks for this RFC. Meta is also planning to resume the work in the materialized views. Back in the days, I had initially tried to build this initially in the planning phase; however, that led to really complex and messy structures, which were very difficult to work with. We probably need even higher level abstraction. Let's align with @arhimondr, @kaikalur and @zation99 on this.

Two issues

First one is to come up with a good definition of the view
Overhead in the view maintenance

so we need to do some cost benefit analysis. Best might be to keep it narrowly scoped like a "preagg a single table" view for aggregation only queries to start with. Once that stabilizes, we can extend it to other shapes.

tdcmeehan · 2025-08-21T14:18:47Z

Thanks for the comments @jainxrohit and @kaikalur.

@kaikalur to be clear, this design allows for your recommendation of progressive refinement. This design's default behavior is to be rather simple, exactly as you recommended--single table materialized views with simple stale/fresh checks. Progressive enhancements, such as fine grained refresh or unioning up to date partitions with more up to date data, are optional and can be implemented by connectors as needed.

@jainxrohit thanks for the background. I believe this design abstracts complexity and allows for progressive enhancement. I'm keen to maintain momentum on the implementation. Please let me know if you need any additional information or clarification to speed up the review.

jainxrohit · 2025-08-21T19:44:01Z

Thanks for the comments @jainxrohit and @kaikalur.

@kaikalur to be clear, this design allows for your recommendation of progressive refinement. This design's default behavior is to be rather simple, exactly as you recommended--single table materialized views with simple stale/fresh checks. Progressive enhancements, such as fine grained refresh or unioning up to date partitions with more up to date data, are optional and can be implemented by connectors as needed.

@jainxrohit thanks for the background. I believe this design abstracts complexity and allows for progressive enhancement. I'm keen to maintain momentum on the implementation. Please let me know if you need any additional information or clarification to speed up the review.

I think the main detail I need to understand better is the proposal to move materialized view optimization from the analysis phase to the planning phase. I believe doing this in the planning phase, especially for the use cases Meta wants to support, is really complex. Should we set up some time to discuss the details further?

tdcmeehan · 2025-08-21T20:35:02Z

@jainxrohit I'll be happy to brainstorm how this might work for your internal infrastructure.

RFC-0012-materialized-views.md

aaneja · 2025-09-08T04:21:04Z

RFC-0013-materialized-views.md

+
+* Backward compatibility with existing Hive MV implementation (it's undocumented and unused)
+* Supporting materialized views across different connectors (cross-connector MVs)
+* Automatic view selection/routing for arbitrary queries


While query rewriting is mentioned as a non-goal, using planning-phase plans for MV matching / query rewriting would be far better than the AST based rewriter we have now.

This is how Calcite (and therefore Hive) does MV rewrites - https://calcite.apache.org/docs/materialized_views.html as well e.g

Could you please clarify what makes plan-phase MV matching/rewriting far better? Although Calcite implements it in planning, they are still rule based matching and seems nothing specific from planning phase is utilized.

In calcite they do have mechanisms to compare cost of plans to select the best mview

Any cost based decision would need to be done in the planner, and there are numerous options: cost based refresh strategy (incremental vs. full), cost based stitching determination, and cost based view replacement. We don't need to implement any of this now, but keeping this rewriting in the analysis phase makes starting those projects harder, because they become not only about implementing the cost based decisions, but also rewriting the the stitching and view replacement code. We want to have an excellent and robust materialized view support in Presto, which means intelligent decisions on when to materialize and what views to select, which means moving to the planner is a worthwhile endeavor.

The general idea of Calcite and the paper they built upon is to have the candidates built with rule based matching that can be done either in analysis or planning, and then have a cost based analysis in planning to finally choose the one or have further optimizations. In Presto, the rule based matching to generate candidates can be easier done in analysis though.

In calcite they do have mechanisms to compare cost of plans to select the best mview

That's right. Comparing can be done with cost-based estimation in planning. But the candidates to compare is easier to generate in analysis.

@zation99

In Presto, the rule based matching to generate candidates can be easier done in analysis though.

In Presto, all the rule based plan matching happens in the planner (i.e in the PlanOptimizer rules), not in the analysis phase.

Can you clarify, with an example, what do you mean by candidate generation and the rules to generate these candidates ?

aditi-pandit

@tdcmeehan : Did a read through this RFC and its a heavy read. I have some early comments, but will likely have more as I absorb this material.

Are you planning to do a talk through of this RFC in one of the working groups ? Might be good to get consensus.

RFC-0013-materialized-views.md

RFC-0012-materialized-views.md

RFC-0013-materialized-views.md

RFC-0012-materialized-views.md

RFC-0013-materialized-views.md

RFC-0012-materialized-views.md

jaystarshot

LGTM, thanks

jaystarshot · 2025-09-19T17:18:20Z

RFC-0013-materialized-views.md

+### 4. Query Rewriting
+- Simple pattern matching to detect if a query can use an MV
+- Limited to exact matches or simple projections
+- No cost-based selection between multiple MVs


I think this is also a limitation of the current design since the analyzer is still doing the replacement

I guess we can in future do something like the MviewReference(CTEReference --> LogicalCte) or drop it approach. i.e the analyzer creates candidates and the planner chooses them somehow. Could also help with fallback

You are right that this is a limitation also of this design, but the current solution makes it impossible to add cost-based selection the future, whereas this design allows for this future enhancement, perhaps along the lines of what you mention.

But the problem remains when we would do "view replacement", we would need to create some canonicalized AST to replace.

I believe the new design will handle this better, or at least I can see a high level path on how we can get that to work. Connectors return a list of possible replacement candidates via the existing ConnectorMetadata#getReferencedMaterializedViews for a given table during analysis. We put this in the MaterializedViewNode AST, and plan each of all of the candidates. At this point I believe we have all of the tools necessary to leverage some of the techniques that have been written about, for example the technique Calcite uses.

RFC-0013-materialized-views.md

RFC-0012-materialized-views.md

RFC-0013-materialized-views.md

RFC-0012-materialized-views.md

tdcmeehan · 2025-09-30T19:33:42Z

@jainxrohit @zation99 @arhimondr @aditi-pandit @jaystarshot @hantangwangd @aaneja

Thanks for all your previous reviews and discussions. One of the core assumptions in my design before was that the existing materialized view framework was unused, and so I had creative liberty to redesign it without a significant impact to existing users. With the revelation that Meta will be entering production soon with the existing framework, this means that no longer holds true, and I've updated the design to emphasize continuity and provide a reasonable migration path. Sharing stitching and refresh logic between the open source Iceberg connector and Meta's vast internal infrastructure will be a huge advantage to the open source project.

The RFC now primarily proposes: 1) Refactoring the stitching to be reliant on predicates provided by the connector outside of the table handle (table handle provided staleness still works fine, but open source connectors need to do expensive metadata calls to determine staleness, and this works better as a separate call outside of analysis). 2) A new type of predicate is added which can support time travel based stitching in exclusion to partition based stitching (optional, only if the connector supports that feature). 3) Support for automatic REFRESH (using the same predicates used for stitching).

Please take a look and let me know if you have any feedback.

aditi-pandit · 2025-10-02T21:53:03Z

RFC-0013-materialized-views.md

+The existing materialized views implementation is located primarily in the presto-hive module and follows this flow:
+
+### 1. Metadata Storage
+- MVs are stored as Hive tables with special properties


Nit : This metadata part is unchanged in the RFC. Maybe just say that explicitly.

aditi-pandit · 2025-10-02T22:02:26Z

RFC-0013-materialized-views.md

+#### Query-Time Analysis (SELECT from MV)
+When querying a materialized view, the system:
+- Recognizes the table as an MV during name resolution
+- Calculates staleness by checking partition modification times


How is this achieved ? The analysis doesn't really know about partition key, unless it is added as part of the metadata. Also, how do we obtain partition modification time ? Can we call out to the connector from analysis itself ?

It's piped through the predicateDisjuncts which are just TupleDomains on the MaterializedViewStatus object, which is called during analysis. This check is probably reasonably lightweight for Hive, but for Iceberg this will be more sophisticated and more complicated, and also more resource intensive since it involves reading the Iceberg metadata from the filesystem. In my proposal this call to retrieve the MaterializedViewStatus happens later, after queuing, but they involve similar predicates.

RFC-0013-materialized-views.md

aditi-pandit · 2025-10-02T23:07:51Z

RFC-0013-materialized-views.md

+![Stitching Flow Comparison](./RFC-0013/current-vs-rfc-stitching.png)
+
+New AST Node: MaterializedViewScan
+- Left side: Data table scan (storage table)


WHy is this left side and right side ? Makes it sound like a join.

I have this as left and right because it's a tree in the AST. The left is just a leaf node representing the base table scan, the right is a subtree representing the definition.

I made the language more explicit.

aditi-pandit · 2025-10-02T23:08:33Z

RFC-0013-materialized-views.md

+New AST Node: MaterializedViewScan
+- Left side: Data table scan (storage table)
+- Right side: View query structure
+- MV metadata for planning decisions


Do you know some fields for this already ?

It will be the the MaterializedViewDefinition, updated.

aditi-pandit · 2025-10-02T23:14:04Z

RFC-0013-materialized-views.md

+- MV metadata and symbol mappings for optimization
+- Standard `PlanNode` that integrates with existing infrastructure
+
+Optimization Phase: MaterializedViewOptimizer Rule (Default Implementation)


Where is this rule positioned in the planner rules application ? Closer to the end of the rules list ? If there are predicates etc we want them to be pushed down as much as possible before choosing a MV.

I picture this being early during optimization, so that all of our usual predicate pushdown rules apply to the stitched query.

aditi-pandit · 2025-10-02T23:17:22Z

RFC-0013-materialized-views.md

+- Fresh MV: returns data table scan directly
+- Partially stale: builds UNION of filtered fresh data + recomputed stale data
+
+The planner constructs stitching plans by adding standard `UnionNode`, `TableScanNode`, and `FilterNode` nodes, then relies on existing planner rules for optimization. This design uses only the existing `MaterializedViewStatus` SPI method. Connectors only need to detect staleness and provide constraint information - they don't need to understand or implement stitching logic themselves. Filter nodes are pushed down by later planner rules through existing constraint pushdown mechanisms.


This is non-intuitive to me... Why are filter nodes pushed later ? Materialized views are only table-scans (by nature its like a physical optimization that chooses an index of a table if available)

The point I'm making is that the job of the MaterializedViewOptimizer is to be as simple as possible, as it's already a little complicated: just create the union with appropriate predicates. It will not also be its job to ensure those predicates are optimized correctly, that will be handled by our general optimizer rules.

hantangwangd

@tdcmeehan thanks for the adjustments. I've just left some questions for discussion, mainly about the Iceberg implementation.

RFC-0013-materialized-views.md

hantangwangd

Thanks for the fix, LGTM!

zation99 · 2025-10-06T04:59:48Z

RFC-0013/current-vs-rfc-refresh.png

asking for some clarifications in the RFC refresh:

in Analysis, what does "store MV definition"/"store WHERE clause" mean? where are we going to store them?

Why the first two steps of planning can't be done in analysis, where constructing the insert is easier?

We will store them in the Analysis object, as we do currently. What I mean by "store WHERE clause" is generate the predicates for WHERE but don't apply them to the query, because depending on the connector you might not be able to do that (we might want to throw if that is supplied). We would store the WHERE clause as a standalone Expression, which we would then turn into a filter in the planner, which is an existing pattern.

Because we need to apply the predicates depending on the shape of the query (whether or not there are aggregations present, among other criteria), it seems more straightforward to apply this during planning. I'm also worried about metadata overhead for Iceberg, where we'll have to scan potentially many files, and the unbounded nature of the work we do in the analysis phase. In all cases, analysis just accesses the catalog which is usually just an RPC call, but more expensive metadata calls happen later. By moving it to planning we have a natural cap on concurrency, since that is post-queueing (effectively lazy rather than eager).

Regarding the complexity of it, I think the most complicated thing to do is to determine when to apply the predicates based on the shape of the query, which I don't see being much easier in analysis. Do you picture applying the predicates to be very complicated? I picture it being about as complicated as applying row filters and column masks, which we do in the planner now.

zation99 · 2025-10-06T05:08:26Z

RFC-0013-materialized-views.md

+
+* Backward compatibility with existing Hive MV implementation (it's undocumented and unused)
+* Supporting materialized views across different connectors (cross-connector MVs)
+* Automatic view selection/routing for arbitrary queries


The general idea of Calcite and the paper they built upon is to have the candidates built with rule based matching that can be done either in analysis or planning, and then have a cost based analysis in planning to finally choose the one or have further optimizations. In Presto, the rule based matching to generate candidates can be easier done in analysis though.

zation99 · 2025-10-06T05:09:10Z

RFC-0013-materialized-views.md

+
+* Backward compatibility with existing Hive MV implementation (it's undocumented and unused)
+* Supporting materialized views across different connectors (cross-connector MVs)
+* Automatic view selection/routing for arbitrary queries


In calcite they do have mechanisms to compare cost of plans to select the best mview

That's right. Comparing can be done with cost-based estimation in planning. But the candidates to compare is easier to generate in analysis.

zation99 · 2025-10-06T05:10:23Z

RFC-0013/current-vs-rfc-stitching.png

Shall we include involving connectors into the graph as well?

That would be getMaterializedViewStatus()

prestodb-ci added the from:IBM PRs from IBM label Aug 14, 2025

tdcmeehan force-pushed the mv branch 3 times, most recently from b3e4930 to 26b27c5 Compare August 14, 2025 03:35

tdcmeehan marked this pull request as ready for review August 14, 2025 15:30

prestodb-ci requested review from a team, Mariamalmesfer and pramodsatya and removed request for a team August 14, 2025 15:30

rschlussel requested review from zation99 and jainxrohit August 18, 2025 17:20

jainxrohit suggested changes Aug 21, 2025

View reviewed changes

tdcmeehan assigned tdcmeehan, aaneja and aditi-pandit and unassigned tdcmeehan Sep 3, 2025

aaneja reviewed Sep 8, 2025

View reviewed changes

tdcmeehan force-pushed the mv branch from 26b27c5 to ef6be21 Compare September 10, 2025 15:49

aditi-pandit reviewed Sep 11, 2025

View reviewed changes

jaystarshot reviewed Sep 19, 2025

View reviewed changes

RFC-0012-materialized-views.md Outdated Show resolved Hide resolved

jaystarshot reviewed Sep 19, 2025

View reviewed changes

tdcmeehan force-pushed the mv branch from ef6be21 to b624efc Compare September 19, 2025 19:49

hantangwangd reviewed Sep 21, 2025

View reviewed changes

RFC-0013-materialized-views.md Show resolved Hide resolved

hantangwangd reviewed Sep 21, 2025

View reviewed changes

RFC-0012-materialized-views.md Outdated Show resolved Hide resolved

RFC-0012-materialized-views.md Outdated Show resolved Hide resolved

RFC-0013-materialized-views.md Show resolved Hide resolved

RFC-0012-materialized-views.md Outdated Show resolved Hide resolved

jainxrohit reviewed Sep 23, 2025

View reviewed changes

RFC-0012-materialized-views.md Outdated Show resolved Hide resolved

prestodb deleted a comment from zation99 Sep 25, 2025

tdcmeehan force-pushed the mv branch 2 times, most recently from bcdc3b6 to b5f163e Compare September 30, 2025 19:03

tdcmeehan force-pushed the mv branch 9 times, most recently from 19f370a to 15d316b Compare October 1, 2025 18:30

jaystarshot approved these changes Oct 2, 2025

View reviewed changes

aditi-pandit reviewed Oct 2, 2025

View reviewed changes

tdcmeehan force-pushed the mv branch 2 times, most recently from dbc156b to 9cb3aea Compare October 3, 2025 02:05

hantangwangd reviewed Oct 3, 2025

View reviewed changes

tdcmeehan force-pushed the mv branch from 9cb3aea to 5fda97c Compare October 3, 2025 15:52

Add materialized views RFC

c534f48

tdcmeehan force-pushed the mv branch from 5fda97c to c534f48 Compare October 3, 2025 19:00

hantangwangd approved these changes Oct 4, 2025

View reviewed changes

zation99 reviewed Oct 6, 2025

View reviewed changes

aaneja self-requested a review October 7, 2025 07:06

aaneja approved these changes Oct 7, 2025

View reviewed changes

Add materialized views RFC #44

Are you sure you want to change the base?

Add materialized views RFC #44

Conversation

tdcmeehan commented Aug 14, 2025

Uh oh!

jainxrohit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Aug 21, 2025

Uh oh!

tdcmeehan commented Aug 21, 2025

Uh oh!

jainxrohit commented Aug 21, 2025

Uh oh!

tdcmeehan commented Aug 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaystarshot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdcmeehan commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jainxrohit left a comment •

edited

Loading

jaystarshot Sep 19, 2025 •

edited

Loading

tdcmeehan commented Sep 30, 2025 •

edited

Loading

tdcmeehan Oct 7, 2025 •

edited

Loading