-
Notifications
You must be signed in to change notification settings - Fork 243
Use PandasCursor for Athena dataframes in fetchdf magic #5072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Oh, nice, Athena has some kind of "native" DataFrame provided by the library? In that case, _fetch_native_df() should be implemented on If it relies on an optional extra being available, and that extra is not available, then it can fall back to the existing/current logic from |
Nope, uv has never been the default, it just happens to be what that user was using. Generally what we do is something like:
The pre-commit hooks should run on Is there a specific error you're getting? |
@erindru I was able to resolve my issue by not using But weirdly, |
Only applies to the
%%fetchdf
magic command (although it could be expanded).Replaces the generic
pandas.read_sql_query()
with PandasCursor for improved I/O performance.With
pandas.read_sql_query()
:interface
The key bottleneck is step 2-3: The data has to go through the PyAthena cursor's fetchall() method, which retrieves results row-by-row via AWS API calls.
With
PandasCursor
:The key advantage is step 2: Instead of going through AWS APIs row-by-row, PandasCursor downloads the entire result CSV file directly from S3 and then uses pandas' highly optimized CSV reading capabilities.
Anecdotally, I had queries that were taking 30+ min to execute that are now taking ~2 min.