-
Notifications
You must be signed in to change notification settings - Fork 403
Description
I want to report a bug in fsspec / kerchunk related to how .parquet reference files served over HTTP are handled. The issue is that HTTP .parquet refs get misinterpreted as JSON refs, because of the isfile(...) logic in ReferenceFileSystem.
Reproduction / Example
There is a .parquet reference file available via HTTP, e.g.
https://data-taos.ifremer.fr/kerchunk/copernicus_NWSHELF_2022.parquet/
When doing xr.open_dataset(uri, engine="kerchunk"), internally fsspec calls in fsspec/implementations/reference.py:
ref_fs, fo2 = fsspec.core.url_to_fs(fo, **dic)
if ref_fs.isfile(fo2):
# text JSON
with fsspec.open(fo, "rb", **dic) as f:
logger.info("Read reference from URL %s", fo)
text = json.load(f)
self._process_references(text, template_overrides)
else:
# Lazy parquet refs
logger.info("Open lazy reference dict from URL %s", fo)
self.references = LazyReferenceMapper(
fo2,
fs=ref_fs,
cache_size=cache_size,
)
Because ref_fs.isfile(fo2) returns True for that HTTP URI (since the server responds with 200 OK), it enters the JSON branch. But the content is not a JSON reference dict — it's either a directory listing, an HTML index, or an HTTP page, not the actual parquet file. This leads to a JSON parsing error (e.g. JSONDecodeError).
File /srv/conda/envs/notebook/lib/python3.12/site-packages/fsspec/implementations/reference.py:705, in ReferenceFileSystem.__init__(self, fo, target, ref_storage_args, target_protocol, target_options, remote_protocol, remote_options, fs, template_overrides, simple_templates, max_gap, max_block, cache_size, **kwargs)
703 with fsspec.open(fo, "rb", **dic) as f:
704 logger.info("Read reference from URL %s", fo)
--> 705 text = json.load(f)
706 self._process_references(text, template_overrides)
707 else:
708 # Lazy parquet refs
JSONDecodeError: Expected object or value
I precise that in local (obviously) the .parquet can be open properly and is not corrupted.