-
Notifications
You must be signed in to change notification settings - Fork 185
add support for DISTINCT ON
#1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add support for DISTINCT ON
#1620
Conversation
|
This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off. |
Both options are fine with me. Whatever is more convenient to you. |
|
Are you interested in doing more PRs in the future? If so, I'm happy to invest the time in review 😀 |
|
In that case I would be happy about a review. :) |
hadley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review; I'll complete after rebasing so it's styled with air.
| stopifnot(is_lazy_sql_part(order_by)) | ||
| check_number_whole(limit, allow_infinite = TRUE, allow_null = TRUE) | ||
| check_bool(distinct) | ||
| # distinct = FALSE -> no distinct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great comment that helps me understand the intent of the rest of the code 😄
But it does illustrates a problem with lazy_select_query() — it isn't properly documented and thus it's hard to tell what the types of the inputs are. You don't need to fix it here, but it might be a useful follow up PR.
R/lazy-select-query.R
Outdated
| } else { | ||
| select <- new_lazy_select(select) | ||
| if (!is.logical(distinct)) { | ||
| distinct <- new_lazy_select(distinct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this here and not in sql_build.lazy_select_query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to sql_build.lazy_select_query() now. Originally I did this here because it seemed more consistent with the select variable.
R/lazy-select-query.R
Outdated
| #' @export | ||
| print.lazy_select_query <- function(x, ...) { | ||
| cat_line("<SQL SELECT", if (x$distinct) " DISTINCT", ">") | ||
| cat_line("<SQL SELECT", if (!isFALSE(x$distinct)) " DISTINCT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to show the variables here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new line that prints the DISTINCT ON variables.
R/sql-clause.R
Outdated
|
|
||
| sql_distinct <- | ||
| if (isTRUE(distinct)) { | ||
| " DISTINCT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our usual style would be to put sql_distinct <- in each branch or extract out a helper function. I'd probably lean towards a helper function since you could inline that in clause below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I have done now? Or should the helper function be inside sql_clause_select()?
hadley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!
R/verb-distinct.R
Outdated
| if (is_identity(syms(distinct), names(distinct), colnames(.data))) { | ||
| TRUE | ||
| } else { | ||
| syms(distinct) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that's worth checking? Seems like it would be pretty rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. I removed the extra check.
Thanks for the review! The ordering issue is still open (from the MR description):
Do you have thoughts on that? |
This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using
distinct(..., .keep_all = TRUE)which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently,.keep_all = TRUEis implemented using window functions. UsingDISTINCT ONinstead promises a performance boost of that operation.I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the
lazy_select_querydata structure itself:Currently, a
lazy_select_querysupports adistinctstate, which can be eitherTRUEorFALSE(corresponding to a normalSELECTvs. aSELECT DISTINCT.The basic idea of this PR is to add a third state to the
distinctattribute which represents a list of columns that belong to theSELECT DISTICT ON (...)clause.Dbplyr backends can opt in to make use of
DISTINCT ONvia implementing a method of the new genericsupports_distinct_on()that returnsTRUE.Open issues
DISTINCT ONusesORDER BYto specify an ordering. That is also a reason whyORDER BYis allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids theORDER BYstatement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspectivewindow_order()would probably still be the right verb to specify the order.sql_clause()is not used correctly.distinctattribute that holds eitherTRUE,FALSEor a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.