Zarr (data format)

Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks. Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia. It has been used by organizations such as Google and Microsoft to publish large datasets. Early versions of Zarr were first released in 2015 by Alistair Miles.

Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel.

Format description

The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.

An illustration of Zarr's chunking data format.

Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.

Applications

For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions. Similarly, Zarr is being used to publish weather and satellite data and energy data, among others.

See also

References


Uses material from the Wikipedia article Zarr (data format), released under the CC BY-SA 4.0 license.