[pydap backend] enables downloading multiple dim arrays within single http request #10629

Mikejmnez · 2025-08-12T18:46:58Z

Closes make pydap backend more opendap-like by downloading multiple variables in same http request #10628
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

With this PR, the following is true:

import xarray as xr
from requests_cache import CachedSession
session=CachedSession(cache_name='debug')
session.cache.clear()

dap4urls = ["dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc", 
            "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc"]

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=session, concat_dim='TIME', parallel=True, combine='nested', decode_times=False)

session.cache.urls()
>>>['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dmr']

And so the dimensions are batched (downloaded) together in same always in DAP4.

In addition to this, and to preserve backwards functionality before, I added an backend argument batch=True | False. When batch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).
When batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.

cache_session=CachedSession(cache_name='debug')

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=cache_session, parallel=True, combine='nested', concat_dim="TIME", decode_times=False, batch=True)

len(cache_session.cache.urls())
>>> 4 # 1dmr and 1 dap per file (2 files)

# triggers all non-dimension data to be downloaded in a single http request
ds.load()

len(cache_session.cache.urls())
>>> 6 # the previous 4, plus an extra request extra per file

When batch=False (False is the default) , the last step (ds.load()) triggers individual downloads.

These changes allow a more performant download experience with xarray+pydap. However ,must of these changes depend on a yet-to-release version of pydap (3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code. pydap 3.5.6 has been released!

Mikejmnez · 2025-08-13T16:49:53Z

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

shoyer

Thanks @Mikejmnez !

xarray/backends/pydap_.py

shoyer · 2025-08-18T17:35:40Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

shoyer · 2025-08-18T21:44:15Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

I'm seeing the same error over here:
#10649

Not quite sure what to make of this, but seems to be a separate bug.

Mikejmnez · 2025-08-18T22:14:07Z

Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :)

xarray/backends/pydap_.py

Mikejmnez · 2025-09-19T15:46:31Z

@shoyer I had a second go at this finally. Moved much of the logic to the backend.

Here is the current state of things:

This PR installs pydap from source. Why? I want to leave the door open for changes on the pydap backend, that may arise from this PR, and include them in the new pydap release. Only when there is a general feeling that this PR is ready to be merged will I then make a pydap release and revert to installing pydap from conda. More comments/request for changes about this PR are welcome!
~~Failing test is unrelated to this PR. But I think I found the potential culprit in the dap4 metadata parser in pydap. Will spend today working on that. This needs to be fixed asap.~~

Mikejmnez · 2025-09-26T17:23:26Z

@shoyer This is ready for further reviewing.

Pydap has a new release that fixes some issues on the backend xml parser (there was a bug that got fixed). I think there may be some additional work to be needed in the next couple of weeks, but these are unrelated to this PR anyways...

~~I did not know what to make of Mypy fails, but these also fail on the main branch too~~. Fixed in #10792

Mikejmnez · 2025-09-30T20:48:14Z

@shoyer Let me know if there is any feedback, concerns, further reviewing, etc.

This PR enables a new (non-default) feature that was added to the pydap backend over the span of several months, namely the ability to download multiple variables within single request, according to the opendap spec. Without this feature, each variable is downloaded separately, which does not take advantage of the opendap protocol, and can make pydap unusable when each remote file has ~>2-3 variables, and there are at least >10 urls to consolidate (for example via mds = xr.open_mfdataset and then mdf.to_zarr or something).

This PR also makes it so that when accessing via dap4 protocol, all dimensions are downloaded within single request by default, always. This is the most performant approach compared to downloading each dimension using a separate request. This again improves performance when "only opening" multiple remote files.

doc/whats-new.rst

Mikejmnez · 2025-11-10T21:01:49Z

@shoyer any further comments?

I'd be happy if at least some of the features within this PR are incorporated, specially the feature of always downloading all dimensions at once (i.e. single dap url for all N dims instead of N dap urls for N dims), when dap4 is the protocol. That would make a significant performance difference. In that simple scenario, batch is no longer necessary to add as an extra argument. So no extra logic needed from the user to get the performance gains.

In the general case (which this PR enables), the user needs to specify batch=True, to ensure a "safe" approach. An "unsafe" approach is when the remote file is a virtually aggregation of many (nc) files , often with an ncml extension. In that scenario you want to download individual variables along individual dap requests...

xarray/backends/pydap_.py

dcherian · 2025-11-18T16:13:18Z

xarray/backends/pydap_.py

+            var.dataset.enable_batch_mode()
+            data_array = self._get_data_array(var)
+            data = indexing.LazilyIndexedArray(data_array)
+            var.dataset.disable_batch_mode()


as an aside a with var.dataset.batch_mode(): context manager would be nice API for this

Thanks @dcherian. That is a really good suggestion. Currently the enable_batch... method does not support the context manager protocol (it was never meant to be turned on/off). I totally see what you mean. I'll set it up (and come back to this at a later pr)

So why are we turning it off here then

Short answer: It doesn't need it, but is there to show it is only used for dap4 dimensions.
Long answer: the scope of this PR was originally broader, and it worked beyond dimensions:

Originally I enabled an optional parameter batch=None (default). And so while dimensions were always downloaded within a single dap response (in dap4), there was the option to download other non-dimension variables in a separate individual dap response (say when executing ds.load()). With pure pydap, there is no distinction between dims and non-dims. But xarray loads eagerly dims into memory, and so I split the logic this way.

I slimmed the PR to do only dimensions, and the performance gain is enough when using xr.open_mfdataset, that I am pleased if this is merged. I'd more than gladly restore the batch=None | Iterable behavior, that further enables non-dimensions to get "batch downloaded" together, for further performance gains. The need for the optional parameter (as opposed to default for dap4) relates to best/safe practices when the remote url points to a virtually aggregated dataset (for example .ncml). In that scenario, "Batch downloading" should prob be avoided, and so this behavior needs to be optional and user-specified aware.

Short answer: It doesn't need it, but is there to show it is only used for dap4 dimensions.

I stand corrected - I ran some tests with ds.load() and in the current behavior, it does need to be disabled. I like the idea of using the context manager protocol. But it is not currently implemented.
Apologies - I have looked at this for so long, I am starting to get confused with the different iterations of this PR.

So this PR would either:

a) Stay as is: download only dims within single dap url.
b) Incorporate more general behavior (enable non-dims, original purpose of this PR).

I have no urgency on this, and my preference would be b) if that is OK with you. I could implement/enable the context manager protocol to improve the API and include it in pydap's next release

Happy to defer to your judgement and preferences here :)

The context manager is just a nice-to-have and not a blocker. It does sound like there's already a nice improvement. I'm happy to merge as-is at the moment

Generally I recommend merging smaller, incremental changes whenever feasible. They are easier to review and improvements out into the world faster.

(There is a separate question of whether the pydap backend should be split of Xarray, given its growing complexity, but I'm also happy to defer that to another day.)

Generally I recommend merging smaller, incremental changes whenever feasible.

Yeah - Lets merge as is. This is ready, and it will be nice to get something out in the world right now. It's been waaay too long...

(There is a separate question of whether the pydap backend should be split of Xarray, given its growing complexity, but I'm also happy to defer that to another day.)

I can see the "growing complexity" argument on maintainers / developers. Definitely a question for another day. I would be happy to be part of the conversation.

xarray/backends/pydap_.py

…ed at once (per group)

…lable

dcherian · 2025-11-21T15:28:21Z

Thanks @Mikejmnez !

github-actions bot added topic-backends CI Continuous Integration tools dependencies Pull requests that update a dependency file io labels Aug 12, 2025

Mikejmnez changed the title ~~Pydap4 scale~~ [pydap backend] enables downloading/processing multiple arrays within single http request Aug 12, 2025

Mikejmnez marked this pull request as ready for review August 13, 2025 07:11

Mikejmnez mentioned this pull request Aug 18, 2025

make pydap backend more opendap-like by downloading multiple variables in same http request #10628

Closed

shoyer reviewed Aug 18, 2025

View reviewed changes

Mikejmnez commented Aug 19, 2025

View reviewed changes

xarray/backends/pydap_.py Show resolved Hide resolved

Mikejmnez force-pushed the pydap4_scale branch from b3c77a0 to aaa07c4 Compare September 18, 2025 20:17

Mikejmnez force-pushed the pydap4_scale branch from 1687221 to 20fb5cd Compare September 26, 2025 15:35

Mikejmnez force-pushed the pydap4_scale branch 2 times, most recently from aac3163 to 4b516b4 Compare September 30, 2025 20:38

Mikejmnez requested a review from shoyer September 30, 2025 20:41

Mikejmnez force-pushed the pydap4_scale branch from 35e390e to 9d819a2 Compare October 6, 2025 17:51

shoyer reviewed Oct 7, 2025

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

Mikejmnez force-pushed the pydap4_scale branch from b0826fe to 796ec7a Compare November 10, 2025 19:51

Mikejmnez force-pushed the pydap4_scale branch from 796ec7a to cc631f1 Compare November 12, 2025 16:56

Mikejmnez changed the title ~~[pydap backend] enables downloading/processing multiple arrays within single http request~~ [pydap backend] enables downloading multiple dim arrays within single http request Nov 12, 2025

dcherian reviewed Nov 18, 2025

View reviewed changes

xarray/backends/pydap_.py Outdated Show resolved Hide resolved

dcherian reviewed Nov 18, 2025

View reviewed changes

xarray/backends/pydap_.py Outdated Show resolved Hide resolved

Mikejmnez added 23 commits November 18, 2025 20:26

for flaky, install pydap from repo for now

0f0dede

initial tests - quantify cached url

a35efa5

adds tests to datatree backend to assert multiple dimensions download…

fcb2eae

…ed at once (per group)

update testing to show number of download urls

677e3de

simplified logic

7f05a6a

specify cached session debug name to actually cache urls

e360560

fix for mypy

c6ed8bf

user visible changes on whats-new.rst

54f6f8d

impose sorted to get_dimensions method

419b25e

reformat whats-new.rst

747fcc7

revert to install pydap from conda and not from repo

381c499

expose checksum as user kwarg

5f5c4e1

include checksums optional argument in whats-new

e15f8cb

update to newest release of pydap via pip until conda install is avai…

0a2730c

…lable

use requests_cache session with retry-params when 500 errors occur

a5d2b0f

update env yml file to use new pydap release via conda

9a88316

turn on testing on datatree from test.opendap.org

d2835ab

rebase with main

b60adb5

update what`s new

578b31a

removes batch as arg - acts always but only on dimension data arrays

25b08cd

updates tests

0e1ff6c

update whats new

f4f253a

minor code changes

b4c7dda

Mikejmnez force-pushed the pydap4_scale branch from 8a793bf to b4c7dda Compare November 19, 2025 04:50

Mikejmnez and others added 3 commits November 18, 2025 21:02

fix whats new changes

ced359f

formatting

e789324

Merge branch 'main' into pydap4_scale

7bcbd7c

dcherian enabled auto-merge (squash) November 21, 2025 15:28

dcherian merged commit a8efe17 into pydata:main Nov 21, 2025
36 checks passed

Uh oh!

[pydap backend] enables downloading multiple dim arrays within single http request #10629

[pydap backend] enables downloading multiple dim arrays within single http request #10629

Conversation

Mikejmnez commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

Mikejmnez commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mikejmnez commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mikejmnez commented Nov 10, 2025

Uh oh!

Uh oh!

dcherian Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

shoyer Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dcherian commented Nov 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mikejmnez commented Aug 12, 2025 •

edited

Loading

Mikejmnez commented Aug 13, 2025 •

edited

Loading

Mikejmnez commented Aug 18, 2025 •

edited

Loading

Mikejmnez commented Sep 19, 2025 •

edited

Loading

Mikejmnez commented Sep 26, 2025 •

edited

Loading

Mikejmnez commented Sep 30, 2025 •

edited

Loading

Mikejmnez Nov 19, 2025 •

edited

Loading

Mikejmnez Nov 19, 2025 •

edited

Loading