-
Notifications
You must be signed in to change notification settings - Fork 135
Description
- What version of Python are you using?
Python 3.11.9 (main, Jun 24 2024, 14:49:51) [Clang 15.0.0 (clang-1500.3.9.4)]
- What operating system and processor architecture are you using?
macOS-15.3.1-arm64-arm-64bit
-
What are the component versions in the environment (
pip freeze
)?There are a lot, but relevant for the test example:
pandas==2.1.4
snowflake-snowpark-python==1.27.0
- What did you do?
I was trying to use subtract
/ minus
/ except_
and write some unit tests for my code using Snowpark but ran into some odd behavior. I've created a toy example that illustrates the problem below.
from snowflake.snowpark import Session
from datetime import date
session = Session.builder.config("local_testing", True).create()
df1 = session.create_dataframe([[1, 2], [3, 4]])
df2 = session.create_dataframe([[1, 1], [2, 2]])
df1.subtract(df2).show()
- What did you expect to see?
Expected:
---------------
|"_1" |"_2" |
---------------
|1 |2 |
|3 |4 |
---------------
Got:
---------------
|"_1" |"_2" |
---------------
|3 |4 |
---------------
As you can see, the row [1, 2]
is getting filtered out despite not existing in the dataframe being subtracted. This is because both 1
and 2
show up as values among the rows. The bug is on this line of code, as it is checking if all of the values in each row in df1
show up in rows in df2
, but not necessarily the same row. This is due to smushing all the df2
values together via cur_df.values.ravel()
, so we lose the row distinctions.
In Snowflake itself, an equivalent query does what you'd expect:
select * from (select * from values (1, 2), (3, 4)) minus (select * from values (1, 1), (2, 2));
COLUMN1 | COLUMN2
-- | --
3 | 4
1 | 2
- Can you set logging to DEBUG and collect the logs?
N/A