-
Notifications
You must be signed in to change notification settings - Fork 7
Intro to Zarr & Xarray

Zarr is a file format for storing multi-dimensional arrays that is optimized for large datasets in cloud computing environments. It is similar in concept to Cloud Optimized Geotiffs and Cloud Optimized Point Clouds, but for large multi-dimensional arrays. When stored in object cloud storage, portions of Zarr files (known as chunks) can be streamed into client applications like a jupyter notebook for analysis. This allows us to work with datasets without having to download them to our local computer.
It is a chunked, hierarchical format that supports compression and versioning. Additionally, it is portable and can be shared among different software applications.
Zarr is an open-source format that stores compressed N-dimensional arrays in chunks. It is designed for use in parallel computing and supports traditional 2D raster API or the newer multidimensional API. Zarr is like NetCDF-4 in capturing and expressing metadata and data, but it is more flexible than Parquet because it allows for chunking along any dimension. Zarr arrays function like NumPy arrays, but the data is divided into chunks and compressed. It provides similar functionality to HDF5 but with additional flexibility. Zarr offers powerful compression options, supports multiple data store backends, and can read/write NumPy arrays in parallel.
Here are some of the benefits of using Zarr:
- Efficient for large datasets
- Well-suited for cloud computing environments
- Chunked format
- Compressed format
- Hierarchical format
- Versioning support
- Portable format


Xarray is an open source project in Python that extends Pandas, to handle multidimensional data structures that are used in the physical sciences. Xarray integrates labels in the form of dimensions, coordinates and attributes to multidimensional arrays.
Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with dataframes, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java’s Common Data Model (CDM).
Xarrays have two core data structures:
- DataArray, which is a N-dimensional array with labeled coordinates and dimensions. It is a N-dimenisonal generalization of a Pandas.Series.
- Dataset, which is a multidimensional in-memory array database.
xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:
- values: a numpy.ndarray holding the array’s values
- dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
- coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: dict to hold arbitrary metadata (attributes)
Xarray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.
DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property. Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases.
For netDCF and IO: There is a set of optional dependencies when installing Xarray:
For accelerating Xarray:
- scipy for enabling interpolation features for xarray objects.
- bottleneck fast NumPy functions for xarray.
For parallel computing:
- dask.array for parallel computing in Python.
- Xarray Documentation.
- Xarray Tutorial
- Xarray User Guide
- Xarray API Reference
- Xarray. Project Pythia.
- Introduction to Python - ARGO float data. Ocean Data Labs. Rutgers University.