add support for `DISTINCT ON` #1620

lschneiderbauer · 2025-10-01T18:52:12Z

This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using distinct(..., .keep_all = TRUE) which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently, .keep_all = TRUE is implemented using window functions. Using DISTINCT ON instead promises a performance boost of that operation.

I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the lazy_select_query data structure itself:
Currently, a lazy_select_query supports a distinct state, which can be either TRUE or FALSE (corresponding to a normal SELECT vs. a SELECT DISTINCT.
The basic idea of this PR is to add a third state to the distinct attribute which represents a list of columns that belong to the SELECT DISTICT ON (...) clause.
Dbplyr backends can opt in to make use of DISTINCT ON via implementing a method of the new generic supports_distinct_on() that returns TRUE.

Open issues

A major issue is the handling of the order specification. DISTINCT ON uses ORDER BY to specify an ordering. That is also a reason why ORDER BY is allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids the ORDER BY statement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspective window_order() would probably still be the right verb to specify the order.
The syntax highlighting of the currently generated SQL code is incorrect, that's probably because sql_clause() is not used correctly.
Having one distinct attribute that holds either TRUE, FALSE or a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.

I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.

…stinct()`

hadley · 2025-11-22T07:20:53Z

This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off.

lschneiderbauer · 2025-11-22T17:27:15Z

This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off.

Both options are fine with me. Whatever is more convenient to you.

hadley · 2025-11-25T15:27:25Z

Are you interested in doing more PRs in the future? If so, I'm happy to invest the time in review 😀

lschneiderbauer · 2025-11-25T19:30:31Z

In that case I would be happy about a review. :)

hadley

Partial review; I'll complete after rebasing so it's styled with air.

hadley · 2025-11-27T06:37:38Z

R/lazy-select-query.R

  stopifnot(is_lazy_sql_part(order_by))
  check_number_whole(limit, allow_infinite = TRUE, allow_null = TRUE)
-  check_bool(distinct)
+  # distinct = FALSE     -> no distinct


This is a great comment that helps me understand the intent of the rest of the code 😄

But it does illustrates a problem with lazy_select_query() — it isn't properly documented and thus it's hard to tell what the types of the inputs are. You don't need to fix it here, but it might be a useful follow up PR.

R/lazy-select-query.R

hadley · 2025-11-27T06:39:17Z

R/lazy-select-query.R

  } else {
    select <- new_lazy_select(select)
+    if (!is.logical(distinct)) {
+      distinct <- new_lazy_select(distinct)


Why do this here and not in sql_build.lazy_select_query?

I moved it to sql_build.lazy_select_query() now. Originally I did this here because it seemed more consistent with the select variable.

hadley · 2025-11-27T06:40:27Z

R/lazy-select-query.R

 #' @export
 print.lazy_select_query <- function(x, ...) {
-  cat_line("<SQL SELECT", if (x$distinct) " DISTINCT", ">")
+  cat_line("<SQL SELECT", if (!isFALSE(x$distinct)) " DISTINCT",


Do you want to show the variables here?

I added a new line that prints the DISTINCT ON variables.

R/lazy-select-query.R

R/query-select.R

hadley · 2025-11-27T06:49:24Z

R/sql-clause.R


+  sql_distinct <-
+    if (isTRUE(distinct)) {
+      " DISTINCT"


Our usual style would be to put sql_distinct <- in each branch or extract out a helper function. I'd probably lean towards a helper function since you could inline that in clause below.

Like I have done now? Or should the helper function be inside sql_clause_select()?

R/verb-distinct.R

hadley

The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!

R/verb-distinct.R

hadley · 2025-11-27T06:51:56Z

R/verb-distinct.R

+      if (is_identity(syms(distinct), names(distinct), colnames(.data))) {
+        TRUE
+      } else {
+        syms(distinct)


Do you think that's worth checking? Seems like it would be pretty rare.

Yeah, you are right. I removed the extra check.

R/verb-distinct.R

lschneiderbauer · 2025-12-01T21:01:56Z

The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!

Thanks for the review!

The ordering issue is still open (from the MR description):

A major issue is the handling of the order specification. DISTINCT ON uses ORDER BY to specify an ordering. That is also a reason why ORDER BY is allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids the ORDER BY statement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspective window_order() would probably still be the right verb to specify the order.

Do you have thoughts on that?

lschneiderbauer added 7 commits September 30, 2025 20:08

optionally use DISTINCT ON SQL clause when keep_all = TRUE in `di…

04d2822

…stinct()`

simplify and fix whitespace

09f6aa1

some temporary fixes

4fdaa08

we explicitely need to select columns after using ON()

e7db82b

make distinct computations and renaming work as usual

7713bca

add some postgres tests

9b75eb0

fix new tests

b2b7146

lschneiderbauer mentioned this pull request Oct 1, 2025

Implement DISTINCT ON for dbplyr backend duckdb/duckdb-r#384

Open

hadley reviewed Nov 27, 2025

View reviewed changes

hadley added 2 commits November 27, 2025 08:45

Merge commit '0c35ea1a8eb9acb5093f4549365ac329b4646a26'

f1a235d

Reformat with air

d6c04a1

hadley reviewed Nov 27, 2025

View reviewed changes

R/verb-distinct.R Outdated Show resolved Hide resolved

hadley reviewed Nov 27, 2025

View reviewed changes

lschneiderbauer added 2 commits December 1, 2025 21:43

simplifications (review response)

2f0d8da

print DISTINCT ON variables

8b31696

add support for DISTINCT ON #1620

Are you sure you want to change the base?

add support for DISTINCT ON #1620

Uh oh!

Conversation

lschneiderbauer commented Oct 1, 2025

Open issues

Uh oh!

hadley commented Nov 22, 2025

Uh oh!

lschneiderbauer commented Nov 22, 2025

Uh oh!

hadley commented Nov 25, 2025

Uh oh!

lschneiderbauer commented Nov 25, 2025

Uh oh!

hadley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hadley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lschneiderbauer commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add support for `DISTINCT ON` #1620

add support for `DISTINCT ON` #1620