-
Notifications
You must be signed in to change notification settings - Fork 242
Open
Description
Hi,
I’ve encountered an issue with the missing_percent
check when a filter excludes all rows from the dataset. In this scenario, the check unexpectedly fails.
Here’s a minimal reproducible example:
from pyspark.sql import SparkSession
from soda.scan import Scan
spark = SparkSession.builder.appName("SodaScanTest").getOrCreate()
data = [
(1, "Alice", 29),
(2, "Bob", 25),
(3, "Charlie", None),
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("people")
scan = Scan()
scan.set_scan_definition_name("soda_scan_test")
scan.set_data_source_name("spark_df")
scan.add_spark_session(spark)
scan.set_verbose(True)
scan.add_sodacl_yaml_str("""
checks for people:
- missing_percent(age):
fail: when < 100
filter: name = 'Diana'
""")
scan.execute()
if scan.has_check_fails():
print(scan.get_logs_text())
print("Scan failed!")
else:
print("Scan succeeded!")
spark.stop()
Observed output:
INFO | 1/1 check FAILED:
INFO | people in spark_df
INFO | missing_percent(age) fail when < 100 [FAILED]
INFO | check_value: 0.0
INFO | row_count: 0
INFO | missing_count: 0
Scan failed!
Expected behavior:
If the filter excludes all rows (i.e., row_count: 0
), I would expect the check to pass, since there are no non-missing values for the filtered rows. Failing the check in this case seems unintuitive.
Is this the intended behavior? If not, could the check be adjusted to pass when all rows are filtered out?
Thank you!
Metadata
Metadata
Assignees
Labels
No labels