Is it possible to stream read_blob results into memory in chunks rather than all at once (read_parquet seems to work already)? #92

ms7463 · 2025-09-28T02:40:17Z

ms7463
Sep 28, 2025

Hi, I see in this PR that streaming results for the python client was implemented back in 2023 (duckdb/duckdb#6346). However, this doesn't seem to work with read_blob and even though object results are streamed, the memory is loaded for the full query result on first fetch (not sure if this is expected/a bug or not). Given the following query (through the python client):

xx = duckdb.query('''
SELECT *
FROM read_blob('.../**/data.data', hive_partitioning=true)
WHERE ...
''')
xx.fetchone()

When I do xx.fetchone() (or any other xx.fetch* method) it seems to load all the data for the full query into memory first, instead of just that row or chunk. (For the record memory streaming does work properly with read_parquet)

As a work around for now, what I can do is to query with EXCLUDE content and then iterate through the result in whatever chunk size I want and read_blob with that chunks filenames including content. But I'm wondering if there is a first class supported way to stream the results in memory.

(Not sure if this is a python client specific issue, or an underlying duckdb engine issue)

Full repro (ran on google colab):

!pip install --upgrade duckdb
import os
import psutil
import shutil
from pathlib import Path
import duckdb


def display_memory_used(msg):
    proc = psutil.Process(os.getpid())
    mem_info = proc.memory_info()
    print(f'Memory Usage ({msg}): {int(mem_info.rss * 1e-6)}MB')


filepaths = [
    './ddb_stream_issue_repro/a=1/b=1/data.data',
    './ddb_stream_issue_repro/a=1/b=2/data.data',
    './ddb_stream_issue_repro/a=2/b=1/data.data',
]

for filepath in filepaths:
    path = Path(filepath)
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_bytes(b'1' * 100_000_000)  # ~100 MB per file


display_memory_used('Before Query')

xx = duckdb.query('''
select *
from read_blob('./ddb_stream_issue_repro/**/data.data', hive_partitioning=true)
''')

display_memory_used('After Query, Before Fetchone')

xx.fetchone()

display_memory_used('After Fetchone')

shutil.rmtree('./ddb_stream_issue_repro')

And thank you for all the amazing work on duckdb!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to stream read_blob results into memory in chunks rather than all at once (read_parquet seems to work already)? #92

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Is it possible to stream read_blob results into memory in chunks rather than all at once (read_parquet seems to work already)? #92

Uh oh!

ms7463 Sep 28, 2025

Replies: 0 comments

ms7463
Sep 28, 2025