Skip to content

HTTP .parquet reference files mis-interpreted as JSON files due to isfile check in ReferenceFileSystem #1919

@capetienne

Description

@capetienne

I want to report a bug in fsspec / kerchunk related to how .parquet reference files served over HTTP are handled. The issue is that HTTP .parquet refs get misinterpreted as JSON refs, because of the isfile(...) logic in ReferenceFileSystem.

Reproduction / Example

There is a .parquet reference file available via HTTP, e.g.
https://data-taos.ifremer.fr/kerchunk/copernicus_NWSHELF_2022.parquet/

When doing xr.open_dataset(uri, engine="kerchunk"), internally fsspec calls in fsspec/implementations/reference.py:

ref_fs, fo2 = fsspec.core.url_to_fs(fo, **dic)
if ref_fs.isfile(fo2):
    # text JSON
    with fsspec.open(fo, "rb", **dic) as f:
        logger.info("Read reference from URL %s", fo)
        text = json.load(f)
    self._process_references(text, template_overrides)
else:
    # Lazy parquet refs
    logger.info("Open lazy reference dict from URL %s", fo)
    self.references = LazyReferenceMapper(
        fo2,
        fs=ref_fs,
        cache_size=cache_size,
    )

Because ref_fs.isfile(fo2) returns True for that HTTP URI (since the server responds with 200 OK), it enters the JSON branch. But the content is not a JSON reference dict — it's either a directory listing, an HTML index, or an HTTP page, not the actual parquet file. This leads to a JSON parsing error (e.g. JSONDecodeError).

File /srv/conda/envs/notebook/lib/python3.12/site-packages/fsspec/implementations/reference.py:705, in ReferenceFileSystem.__init__(self, fo, target, ref_storage_args, target_protocol, target_options, remote_protocol, remote_options, fs, template_overrides, simple_templates, max_gap, max_block, cache_size, **kwargs)
    703     with fsspec.open(fo, "rb", **dic) as f:
    704         logger.info("Read reference from URL %s", fo)
--> 705         text = json.load(f)
    706     self._process_references(text, template_overrides)
    707 else:
    708     # Lazy parquet refs

JSONDecodeError: Expected object or value

I precise that in local (obviously) the .parquet can be open properly and is not corrupted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions