Skip to content

Slowdown when using multiple NetCDF files #277

@ph-kev

Description

@ph-kev

I am loading multiple NetCDF files aggregated over the time dimension and loading one time slice at a time for use in a simulation.

I am noticing that loading a single time slice from multiple NetCDF files is significantly slower than loading a single time slice from a single NetCDF file. See the MWE below.

First, we create the NetCDF file.

using NCDatasets, DataStructures, BenchmarkTools
ds = NCDataset("test.nc","c", attrib = OrderedDict())
# Dimensions

ds.dim["time"] = 52
ds.dim["lon"] = 360
ds.dim["lat"] = 181

# Declare variables
nctime = defVar(ds,"time", Int32, ("time",), attrib = OrderedDict(
    "long_name"                 => "time",
    "standard_name"             => "time",
    "units"                     => "seconds since 1970-01-01",
    "calendar"                  => "proleptic_gregorian",
))
nclon = defVar(ds,"lon", Float32, ("lon",), attrib = OrderedDict(
    "units"                     => "degrees_east",
    "long_name"                 => "longitude",
    "standard_name"             => "longitude",
))
nclat = defVar(ds,"lat", Float32, ("lat",), attrib = OrderedDict(
    "units"                     => "degrees_north",
    "long_name"                 => "latitude",
    "standard_name"             => "latitude",
))
test_var = defVar(ds,"test", Float32, ("lon", "lat", "time"), attrib = OrderedDict(
    "units"                     => "%",
))
nctime[:] = [i for i in 1.0:52.0]
nclon[:] = [i for i in 0.0:359.0]
nclat[:] = [i for i in -90.0:90.0]
test_var[:,:,:] = zeros(360, 181, 52)

close(ds)

Then, we benchmark loading a single time slice.

using BenchmarkTools
ds = NCDataset("test.nc","r")
@benchmark $ds[$"test"][:,:,1]
close(ds)
ds = NCDataset(["test.nc"],"r", aggdim = "time")
@benchmark $ds[$"test"][:,:,1]
close(ds)

On my machine, I get a median of 59.002 μs for a single file dataset and a median of 1.407 ms for a multifile dataset.

Is there any way to speed up loading a time slice from a multifile dataset and get the same performance as loading from a single file dataset? In particular, loading a time slice that belongs in only one of the files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions