Skip to content

missing_percent unexpected output after filtering all rows #2407

@migueldoblado

Description

@migueldoblado

Hi,

I’ve encountered an issue with the missing_percent check when a filter excludes all rows from the dataset. In this scenario, the check unexpectedly fails.

Here’s a minimal reproducible example:

from pyspark.sql import SparkSession
from soda.scan import Scan

spark = SparkSession.builder.appName("SodaScanTest").getOrCreate()

data = [
    (1, "Alice", 29),
    (2, "Bob", 25),
    (3, "Charlie", None),
]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.createOrReplaceTempView("people")

scan = Scan()
scan.set_scan_definition_name("soda_scan_test")
scan.set_data_source_name("spark_df")
scan.add_spark_session(spark)
scan.set_verbose(True)

scan.add_sodacl_yaml_str("""
checks for people:
  - missing_percent(age):
      fail: when < 100
      filter: name = 'Diana'
""")

scan.execute()

if scan.has_check_fails():
    print(scan.get_logs_text())
    print("Scan failed!")
else:
    print("Scan succeeded!")

spark.stop()

Observed output:

INFO | 1/1 check FAILED:
INFO | people in spark_df
INFO | missing_percent(age) fail when < 100 [FAILED]
INFO | check_value: 0.0
INFO | row_count: 0
INFO | missing_count: 0
Scan failed!

Expected behavior:
If the filter excludes all rows (i.e., row_count: 0), I would expect the check to pass, since there are no non-missing values for the filtered rows. Failing the check in this case seems unintuitive.

Is this the intended behavior? If not, could the check be adjusted to pass when all rows are filtered out?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions