Skip to content

Conversation

@lschneiderbauer
Copy link

This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using distinct(..., .keep_all = TRUE) which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently, .keep_all = TRUE is implemented using window functions. Using DISTINCT ON instead promises a performance boost of that operation.

I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the lazy_select_query data structure itself:
Currently, a lazy_select_query supports a distinct state, which can be either TRUE or FALSE (corresponding to a normal SELECT vs. a SELECT DISTINCT.
The basic idea of this PR is to add a third state to the distinct attribute which represents a list of columns that belong to the SELECT DISTICT ON (...) clause.
Dbplyr backends can opt in to make use of DISTINCT ON via implementing a method of the new generic supports_distinct_on() that returns TRUE.

Open issues

  1. A major issue is the handling of the order specification. DISTINCT ON uses ORDER BY to specify an ordering. That is also a reason why ORDER BY is allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids the ORDER BY statement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspective window_order() would probably still be the right verb to specify the order.
  2. The syntax highlighting of the currently generated SQL code is incorrect, that's probably because sql_clause() is not used correctly.
  3. Having one distinct attribute that holds either TRUE, FALSE or a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.

I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.

@hadley
Copy link
Member

hadley commented Nov 22, 2025

This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off.

@lschneiderbauer
Copy link
Author

This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off.

Both options are fine with me. Whatever is more convenient to you.

@hadley
Copy link
Member

hadley commented Nov 25, 2025

Are you interested in doing more PRs in the future? If so, I'm happy to invest the time in review 😀

@lschneiderbauer
Copy link
Author

In that case I would be happy about a review. :)

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review; I'll complete after rebasing so it's styled with air.

stopifnot(is_lazy_sql_part(order_by))
check_number_whole(limit, allow_infinite = TRUE, allow_null = TRUE)
check_bool(distinct)
# distinct = FALSE -> no distinct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great comment that helps me understand the intent of the rest of the code 😄

But it does illustrates a problem with lazy_select_query() — it isn't properly documented and thus it's hard to tell what the types of the inputs are. You don't need to fix it here, but it might be a useful follow up PR.

} else {
select <- new_lazy_select(select)
if (!is.logical(distinct)) {
distinct <- new_lazy_select(distinct)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this here and not in sql_build.lazy_select_query?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to sql_build.lazy_select_query() now. Originally I did this here because it seemed more consistent with the select variable.

#' @export
print.lazy_select_query <- function(x, ...) {
cat_line("<SQL SELECT", if (x$distinct) " DISTINCT", ">")
cat_line("<SQL SELECT", if (!isFALSE(x$distinct)) " DISTINCT",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to show the variables here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new line that prints the DISTINCT ON variables.

R/sql-clause.R Outdated

sql_distinct <-
if (isTRUE(distinct)) {
" DISTINCT"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our usual style would be to put sql_distinct <- in each branch or extract out a helper function. I'd probably lean towards a helper function since you could inline that in clause below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I have done now? Or should the helper function be inside sql_clause_select()?

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!

if (is_identity(syms(distinct), names(distinct), colnames(.data))) {
TRUE
} else {
syms(distinct)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that's worth checking? Seems like it would be pretty rare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. I removed the extra check.

@lschneiderbauer
Copy link
Author

The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!

Thanks for the review!

The ordering issue is still open (from the MR description):

A major issue is the handling of the order specification. DISTINCT ON uses ORDER BY to specify an ordering. That is also a reason why ORDER BY is allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids the ORDER BY statement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspective window_order() would probably still be the right verb to specify the order.

Do you have thoughts on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants