Intro to Zarr & Xarray

Zarr

Zarr is a file format for storing multi-dimensional arrays that is optimized for large datasets in cloud computing environments. It is similar in concept to Cloud Optimized Geotiffs and Cloud Optimized Point Clouds, but for large multi-dimensional arrays. When stored in object cloud storage, portions of Zarr files (known as chunks) can be streamed into client applications like a jupyter notebook for analysis. This allows us to work with datasets without having to download them to our local computer.

It is a chunked, hierarchical format that supports compression and versioning. Additionally, it is portable and can be shared among different software applications.

Zarr is an open-source format that stores compressed N-dimensional arrays in chunks. It is designed for use in parallel computing and supports traditional 2D raster API or the newer multidimensional API. Zarr is like NetCDF-4 in capturing and expressing metadata and data, but it is more flexible than Parquet because it allows for chunking along any dimension. Zarr arrays function like NumPy arrays, but the data is divided into chunks and compressed. It provides similar functionality to HDF5 but with additional flexibility. Zarr offers powerful compression options, supports multiple data store backends, and can read/write NumPy arrays in parallel.

Here are some of the benefits of using Zarr:

Efficient for large datasets
Well-suited for cloud computing environments
Chunked format
Compressed format
Hierarchical format
Versioning support
Portable format

Zarr Tutorial

Introduction to Xarray

Xarray is an open source project in Python that extends Pandas, to handle multidimensional data structures that are used in the physical sciences. Xarray integrates labels in the form of dimensions, coordinates and attributes to multidimensional arrays.

Introducing the DataArray and Dataset

Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with dataframes, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java’s Common Data Model (CDM).

Xarrays have two core data structures:

DataArray, which is a N-dimensional array with labeled coordinates and dimensions. It is a N-dimenisonal generalization of a Pandas.Series.
Dataset, which is a multidimensional in-memory array database.

Data structures

DataArray

xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:

values: a numpy.ndarray holding the array’s values
dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
attrs: dict to hold arbitrary metadata (attributes)

Xarray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.

DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property. Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases.

Xarray installation and dependencies.

For netDCF and IO: There is a set of optional dependencies when installing Xarray:

netCDF4 if reading/writing netCDF files with Xarray.
scipy, used reading/writing netCDF3.

For accelerating Xarray:

scipy for enabling interpolation features for xarray objects.
bottleneck fast NumPy functions for xarray.

For parallel computing:

dask.array for parallel computing in Python.

References

Xarray Documentation.
Xarray Tutorial
Xarray User Guide
Xarray API Reference
Xarray. Project Pythia.
Introduction to Python - ARGO float data. Ocean Data Labs. Rutgers University.

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intro to Zarr & Xarray

Zarr

Introduction to Xarray

Introducing the DataArray and Dataset

Data structures

DataArray

Xarray installation and dependencies.

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally