Skip to content

bug: count(*) and row_number() of different tables may not be distinct #860

@wangrunji0408

Description

@wangrunji0408
 create table t(a int);
 insert into t values (1), (2), (3);

 select c2, c1 from
     (select count(*) as c1 from t where a = 1),
     (select count(*) as c2 from t);
  expect:
+ 3 1
  actual:
- 1 3

 select c2, c1 from
     (select row_number() over () as c1 from t where a = 1),
     (select row_number() over () as c2 from t);
  expect:
+ 1 1
+ 2 1
+ 3 1
  actual:
- 1 1
- 1 2
- 1 3

The reason is that we use the expression itself as the output schema of a plan.

For both:
    (select count(*) as c1 from t where a = 1)
    (select count(*) as c2 from t)
Their schema is the same:
    [count(*)]

Similarly, for both:
    (select row_number() over () as c1 from t where a = 1)
    (select row_number() over () as c2 from t)
Their schema is the same:
    [row_number() over ()]

In the e-graph of the query, the same expression always has the same Id. So from the projection operator's view, it can not distinguish between c1 and c2 from the two subqueries.

To resolve this issue, the output columns should be associated with the plan that they come from.

Assume the plan Ids are:
23: (select count(*) as c1 from t where a = 1)
42: (select count(*) as c2 from t)
Then their schema should be:
    [count(*)_23]
    [count(*)_42]
Or use an global incremental counter:
    [count(*)_1]
    [count(*)_2]
Personally I prefer the first solution.
Because when we happen to have two count(*) from the same table, they can be deduplicated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions