Skip to content

SNOW-1984396: Snowpark Local Testing minus filters out rows that match values across multiple rows in the subtracted set #3163

@matt-comity

Description

@matt-comity
  1. What version of Python are you using?

Python 3.11.9 (main, Jun 24 2024, 14:49:51) [Clang 15.0.0 (clang-1500.3.9.4)]

  1. What operating system and processor architecture are you using?

macOS-15.3.1-arm64-arm-64bit

  1. What are the component versions in the environment (pip freeze)?

    There are a lot, but relevant for the test example:

pandas==2.1.4
snowflake-snowpark-python==1.27.0
  1. What did you do?

I was trying to use subtract / minus / except_ and write some unit tests for my code using Snowpark but ran into some odd behavior. I've created a toy example that illustrates the problem below.

from snowflake.snowpark import Session
from datetime import date
session = Session.builder.config("local_testing", True).create()
df1 = session.create_dataframe([[1, 2], [3, 4]])
df2 = session.create_dataframe([[1, 1], [2, 2]])
df1.subtract(df2).show()
  1. What did you expect to see?

Expected:

---------------
|"_1"  |"_2"  |
---------------
|1     |2     |
|3     |4     |
---------------

Got:

---------------
|"_1"  |"_2"  |
---------------
|3     |4     |
---------------

As you can see, the row [1, 2] is getting filtered out despite not existing in the dataframe being subtracted. This is because both 1 and 2 show up as values among the rows. The bug is on this line of code, as it is checking if all of the values in each row in df1 show up in rows in df2, but not necessarily the same row. This is due to smushing all the df2 values together via cur_df.values.ravel(), so we lose the row distinctions.

In Snowflake itself, an equivalent query does what you'd expect:

select * from (select * from values (1, 2), (3, 4)) minus (select * from values (1, 1), (2, 2));
COLUMN1 | COLUMN2
--      | --
3       | 4
1       | 2
  1. Can you set logging to DEBUG and collect the logs?

N/A

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstatus-triage_doneInitial triage done, will be further handled by the driver team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions