Skip to content

Conversation

williebsweet
Copy link
Contributor

@williebsweet williebsweet commented Jul 30, 2025

Only applies to the %%fetchdf magic command (although it could be expanded).

Replaces the generic pandas.read_sql_query() with PandasCursor for improved I/O performance.


With pandas.read_sql_query():

  1. Query executes in Athena and results are written to S3 as CSV files
  2. PyAthena connection fetches results row-by-row through the standard DB-API cursor
    interface
  3. Data comes through the AWS API using fetchall() or similar methods
  4. pandas constructs DataFrame from the row-by-row data

The key bottleneck is step 2-3: The data has to go through the PyAthena cursor's fetchall() method, which retrieves results row-by-row via AWS API calls.

With PandasCursor:

  1. Query executes in Athena and results are written to S3 as CSV files
  2. PandasCursor directly downloads the CSV file from S3
  3. pandas loads the CSV directly using optimized CSV parsing

The key advantage is step 2: Instead of going through AWS APIs row-by-row, PandasCursor downloads the entire result CSV file directly from S3 and then uses pandas' highly optimized CSV reading capabilities.


Anecdotally, I had queries that were taking 30+ min to execute that are now taking ~2 min.

@erindru
Copy link
Collaborator

erindru commented Jul 30, 2025

Oh, nice, Athena has some kind of "native" DataFrame provided by the library?

In that case, _fetch_native_df() should be implemented on AthenaEngineAdapter to return a pd.DataFrame that is backed by PandasCursor so that all fetchdf() calls can use it.

If it relies on an optional extra being available, and that extra is not available, then it can fall back to the existing/current logic from PandasNativeFetchDFSupportMixin

@williebsweet
Copy link
Contributor Author

@erindru Yeah - my testing of moving it to engine_adapter/athena.py got derailed by the Athena issue mentioned here.

I'm having issues with getting pre-commit to run. The docs, make file, CI commands, and environment don't seem to be "in sync". Is uv the default now?

@erindru
Copy link
Collaborator

erindru commented Jul 31, 2025

Nope, uv has never been the default, it just happens to be what that user was using.

Generally what we do is something like:

$ python -m venv sqlmesh-env
$ . ./sqlmesh-env/bin/activate
(sqlmesh-env) $ make install-dev 
(sqlmesh-env) $ make install-pre-commit #only have to do this once
# make some changes
(sqlmesh-env) $ git commit ....

The pre-commit hooks should run on git commit and abort the commit if they fail. You should also be able to run make style for something similar. The venv that you ran make install-dev needs to be active for all of these.

Is there a specific error you're getting?

@williebsweet
Copy link
Contributor Author

@erindru I was able to resolve my issue by not using uv.

But weirdly, make style and make py-style pass locally, but the latter failed in the last CI run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants