spark_df_cursor fetch many doesn't take into account order of the data leading to incorrect schema retrieved.

Hello, 

If you have a dataframe which has a large number of fields (over 100) then when the memorysafecursorfetcher gets the rows it will trigger two calls to fetchmany. When the cursor is a spark_df_cursor the fetchmany function isn't deterministic and due to the nature of spark queries being lazy results in the dataframe being received back possibly not being the desired rows.

[link to fetchmany](https://github.com/sodadata/soda-core/blob/afd441ef97947b77d87e448745f649d13cbb181d/soda/spark_df/soda/data_sources/spark_df_cursor.py#L29)
```   
def fetchmany(self, size: int) -> tuple[tuple]:
        rows = []
        self.rowcount = self.df.count()
        spark_rows: list[Row] = self.df.offset(self.cursor_index).limit(size).collect()
        self.cursor_index += len(spark_rows)
        for spark_row in spark_rows:
            row = self.convert_spark_row_to_dbapi_row(spark_row)
            rows.append(row)
        return tuple(rows)
```
The problem code is that the self.df is recalculated each time and thus there is a chance that when the offset is used on the second run that the fields have changed order.


The possible solutions to this are either to rely on the[ PYSPARK DESCRIBE on a table having known column names ](https://spark.apache.org/docs/3.5.3/sql-ref-syntax-aux-describe-table.html) and ordering by these:

[link to sql_get_table_columns](https://github.com/sodadata/soda-core/blob/afd441ef97947b77d87e448745f649d13cbb181d/soda/spark/soda/data_sources/spark_data_source.py#L236)
```
def sql_get_table_columns(
        self,
        table_name: str,
        included_columns: list[str] | None = None,
        excluded_columns: list[str] | None = None,
    ):
        return f"DESCRIBE {table_name} ORDERBY 'col_name'"
```
The orderby being the addition.

or to add the orderby in the fetch many 
```
def fetchmany(self, size: int) -> tuple[tuple]:
    rows = []
    self.rowcount = self.df.count()
    spark_rows: list[Row] = self.df.orderBy("col_name").offset(self.cursor_index).limit(size).collect()
    self.cursor_index += len(spark_rows)
    for spark_row in spark_rows:
        row = self.convert_spark_row_to_dbapi_row(spark_row)
        rows.append(row)
    return tuple(rows)
```

I believe this is the explanation for #2129 but wasn't entirely sure so thought a separate issue would be good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spark_df_cursor fetch many doesn't take into account order of the data leading to incorrect schema retrieved. #2313

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

spark_df_cursor fetch many doesn't take into account order of the data leading to incorrect schema retrieved. #2313

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions