-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[pydap backend] enables downloading multiple dim arrays within single http request #10629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
hmm - the test I see that fails (sporadically) concerns the following assertion: where the groups have reverse ordering in the way dimensions show up ( |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Mikejmnez !
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap. |
I'm seeing the same error over here: Not quite sure what to make of this, but seems to be a separate bug. |
|
Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :) |
b3c77a0 to
aaa07c4
Compare
|
@shoyer I had a second go at this finally. Moved much of the logic to the backend. Here is the current state of things:
|
1687221 to
20fb5cd
Compare
|
@shoyer This is ready for further reviewing. Pydap has a new release that fixes some issues on the backend xml parser (there was a bug that got fixed). I think there may be some additional work to be needed in the next couple of weeks, but these are unrelated to this PR anyways...
|
aac3163 to
4b516b4
Compare
|
@shoyer Let me know if there is any feedback, concerns, further reviewing, etc. This PR enables a new (non-default) feature that was added to the pydap backend over the span of several months, namely the ability to download multiple variables within single request, according to the opendap spec. Without this feature, each variable is downloaded separately, which does not take advantage of the opendap protocol, and can make pydap unusable when each remote file has ~>2-3 variables, and there are at least >10 urls to consolidate (for example via mds = xr.open_mfdataset and then mdf.to_zarr or something). This PR also makes it so that when accessing via dap4 protocol, all dimensions are downloaded within single request by default, always. This is the most performant approach compared to downloading each dimension using a separate request. This again improves performance when "only opening" multiple remote files. |
35e390e to
9d819a2
Compare
b0826fe to
796ec7a
Compare
|
@shoyer any further comments? I'd be happy if at least some of the features within this PR are incorporated, specially the feature of always downloading all dimensions at once (i.e. single dap url for all N dims instead of N dap urls for N dims), when In the general case (which this PR enables), the user needs to specify |
796ec7a to
cc631f1
Compare
| var.dataset.enable_batch_mode() | ||
| data_array = self._get_data_array(var) | ||
| data = indexing.LazilyIndexedArray(data_array) | ||
| var.dataset.disable_batch_mode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an aside a with var.dataset.batch_mode(): context manager would be nice API for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcherian. That is a really good suggestion. Currently the enable_batch... method does not support the context manager protocol (it was never meant to be turned on/off). I totally see what you mean. I'll set it up (and come back to this at a later pr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why are we turning it off here then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: It doesn't need it, but is there to show it is only used for dap4 dimensions.
Long answer: the scope of this PR was originally broader, and it worked beyond dimensions:
Originally I enabled an optional parameter batch=None (default). And so while dimensions were always downloaded within a single dap response (in dap4), there was the option to download other non-dimension variables in a separate individual dap response (say when executing ds.load()). With pure pydap, there is no distinction between dims and non-dims. But xarray loads eagerly dims into memory, and so I split the logic this way.
I slimmed the PR to do only dimensions, and the performance gain is enough when using xr.open_mfdataset, that I am pleased if this is merged. I'd more than gladly restore the batch=None | Iterable behavior, that further enables non-dimensions to get "batch downloaded" together, for further performance gains. The need for the optional parameter (as opposed to default for dap4) relates to best/safe practices when the remote url points to a virtually aggregated dataset (for example .ncml). In that scenario, "Batch downloading" should prob be avoided, and so this behavior needs to be optional and user-specified aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: It doesn't need it, but is there to show it is only used for dap4 dimensions.
I stand corrected - I ran some tests with ds.load() and in the current behavior, it does need to be disabled. I like the idea of using the context manager protocol. But it is not currently implemented.
Apologies - I have looked at this for so long, I am starting to get confused with the different iterations of this PR.
So this PR would either:
a) Stay as is: download only dims within single dap url.
b) Incorporate more general behavior (enable non-dims, original purpose of this PR).
I have no urgency on this, and my preference would be b) if that is OK with you. I could implement/enable the context manager protocol to improve the API and include it in pydap's next release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to defer to your judgement and preferences here :)
The context manager is just a nice-to-have and not a blocker. It does sound like there's already a nice improvement. I'm happy to merge as-is at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I recommend merging smaller, incremental changes whenever feasible. They are easier to review and improvements out into the world faster.
(There is a separate question of whether the pydap backend should be split of Xarray, given its growing complexity, but I'm also happy to defer that to another day.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I recommend merging smaller, incremental changes whenever feasible.
Yeah - Lets merge as is. This is ready, and it will be nice to get something out in the world right now. It's been waaay too long...
(There is a separate question of whether the pydap backend should be split of Xarray, given its growing complexity, but I'm also happy to defer that to another day.)
I can see the "growing complexity" argument on maintainers / developers. Definitely a question for another day. I would be happy to be part of the conversation.
…ed at once (per group)
8a793bf to
b4c7dda
Compare
|
Thanks @Mikejmnez ! |
whats-new.rstWith this PR, the following is true:
And so the dimensions are batched (downloaded) together in same always in DAP4.
In addition to this, and to preserve backwards functionality before, I added an backend argument
batch=True | False. Whenbatch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).When
batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.When
batch=False(Falseis the default) , the last step (ds.load()) triggers individual downloads.These changes allow a more performant download experience with xarray+pydap.
However ,must of these changes depend on a yet-to-release version of pydap (pydap3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code.3.5.6has been released!