You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I see in this PR that streaming results for the python client was implemented back in 2023 (duckdb/duckdb#6346). However, this doesn't seem to work with read_blob and even though object results are streamed, the memory is loaded for the full query result on first fetch (not sure if this is expected/a bug or not). Given the following query (through the python client):
xx = duckdb.query('''
SELECT *
FROM read_blob('.../**/data.data', hive_partitioning=true)
WHERE ...
''')
xx.fetchone()
When I do xx.fetchone() (or any other xx.fetch* method) it seems to load all the data for the full query into memory first, instead of just that row or chunk. (For the record memory streaming does work properly with read_parquet)
As a work around for now, what I can do is to query with EXCLUDE content and then iterate through the result in whatever chunk size I want and read_blob with that chunks filenames including content. But I'm wondering if there is a first class supported way to stream the results in memory.
(Not sure if this is a python client specific issue, or an underlying duckdb engine issue)
Full repro (ran on google colab):
!pip install --upgrade duckdb
import os
import psutil
import shutil
from pathlib import Path
import duckdb
def display_memory_used(msg):
proc = psutil.Process(os.getpid())
mem_info = proc.memory_info()
print(f'Memory Usage ({msg}): {int(mem_info.rss * 1e-6)}MB')
filepaths = [
'./ddb_stream_issue_repro/a=1/b=1/data.data',
'./ddb_stream_issue_repro/a=1/b=2/data.data',
'./ddb_stream_issue_repro/a=2/b=1/data.data',
]
for filepath in filepaths:
path = Path(filepath)
path.parent.mkdir(parents=True, exist_ok=True)
path.write_bytes(b'1' * 100_000_000) # ~100 MB per file
display_memory_used('Before Query')
xx = duckdb.query('''
select *
from read_blob('./ddb_stream_issue_repro/**/data.data', hive_partitioning=true)
''')
display_memory_used('After Query, Before Fetchone')
xx.fetchone()
display_memory_used('After Fetchone')
shutil.rmtree('./ddb_stream_issue_repro')
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I see in this PR that streaming results for the python client was implemented back in 2023 (duckdb/duckdb#6346). However, this doesn't seem to work with read_blob and even though object results are streamed, the memory is loaded for the full query result on first fetch (not sure if this is expected/a bug or not). Given the following query (through the python client):
When I do
xx.fetchone()
(or any otherxx.fetch*
method) it seems to load all the data for the full query into memory first, instead of just that row or chunk. (For the record memory streaming does work properly withread_parquet
)As a work around for now, what I can do is to query with
EXCLUDE content
and then iterate through the result in whatever chunk size I want and read_blob with that chunks filenames including content. But I'm wondering if there is a first class supported way to stream the results in memory.(Not sure if this is a python client specific issue, or an underlying duckdb engine issue)
Full repro (ran on google colab):
And thank you for all the amazing work on duckdb!
Beta Was this translation helpful? Give feedback.
All reactions