--- status: w3c/CG-FINAL level: 1 date: 2023-05-25 version: 0.4 title: Next-generation file formats (NGFF) abstract: >- This document contains next-generation file format (NGFF) specifications for storing bioimaging data in the cloud. All specifications are submitted to the https://image.sc community for review. authors: - name: "Editor: Josh Moore" roles: Editor affiliations: - name: University of Dundee (UoD) ror: https://ror.org/03h2bxq36 url: https://www.dundee.ac.uk - name: "Editor: Sébastien Besson" roles: Editor orcid: https://orcid.org/0000-0001-8783-1429 affiliations: - name: University of Dundee (UoD) ror: https://ror.org/03h2bxq36 url: https://www.dundee.ac.uk - name: "Editor: Constantin Pape" roles: Editor orcid: 0000-0001-6562-7187 affiliations: - name: European Molecular Biology Laboratory (EMBL) url: https://www.embl.org/sites/heidelberg/ ror: https://ror.org/03mstc592 --- # Version 0.4 ## Introduction (version0.4:intro)= Bioimaging science is at a crossroads. Currently, the drive to acquire more, larger, preciser spatial measurements is unfortunately at odds with our ability to structure and share those measurements with others. During a global pandemic more than ever, we believe fervently that global, collaborative discovery as opposed to the post-publication, "data-on-request" mode of operation is the path forward. Bioimaging data should be shareable via open and commercial cloud resources without the need to download entire datasets. At the moment, that is not the norm. The plethora of data formats produced by imaging systems are ill-suited to remote sharing. Individual scientists typically lack the infrastructure they need to host these data themselves. When they acquire images from elsewhere, time-consuming translations and data cleaning are needed to interpret findings. Those same costs are multiplied when gathering data into online repositories where curator time can be the limiting factor before publication is possible. Without a common effort, each lab or resource is left building the tools they need and maintaining that infrastructure often without dedicated funding. This document defines a specification for bioimaging data to make it possible to enable the conversion of proprietary formats into a common, cloud-ready one. Such next-generation file formats layout data so that individual portions, or "chunks", of large data are reference-able eliminating the need to download entire datasets. ### Why "NGFF"? (version0.4:why-ngff)= A short description of what is needed for an imaging format is "a hierarchy of n-dimensional (dense) arrays with metadata". This combination of features is certainly provided by HDF5 from the HDF Group, which a number of bioimaging formats do use. HDF5 and other larger binary structures, however, are ill-suited for storage in the cloud where accessing individual chunks of data by name rather than seeking through a large file is at the heart of parallelization. As a result, a number of formats have been developed more recently which provide the basic data structure of an HDF5 file, but do so in a more cloud-friendly way. In the [PyData](https://pydata.org/) community, the Zarr ({cite:t}`zarr`) format was developed for easily storing collections of [NumPy](https://numpy.org/) arrays. In the [ImageJ](https://imagej.net/) community, N5 ({cite:t}`n5`) was developed to work around the limitations of HDF5 ("N5" was originally short for "Not-HDF5"). Both of these formats permit storing individual chunks of data either locally in separate files or in cloud-based object stores as separate keys. A [current effort](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html) is underway to unify the two similar specifications to provide a single binary specification. The editor's draft will soon be entering a [request for comments (RFC)](https://github.com/zarr-developers/zarr-specs/issues/101) phase with the goal of having a first version early in 2021. As that process comes to an end, this document will be updated. ### OME-NGFF (version0.4:ome-ngff)= The conventions and specifications defined in this document are designed to enable next-generation file formats to represent the same bioimaging data that can be represented in \[OME-TIFF](http://www.openmicroscopy.org/ome-files/) and beyond. However, the conventions will also be usable by HDF5 and other sufficiently advanced binary containers. Eventually, we hope, the moniker "next-generation" will no longer be applicable, and this will simply be the most efficient, common, and useful representation of bioimaging data, whether during acquisition or sharing in the cloud. Note: The following text makes use of OME-Zarr ({cite:t}`ome-zarr-py`), the current prototype implementation, for all examples. ### Document conventions The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” are to be interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119).
Transitional metadata is added to the specification with the intention of removing it in the future. Implementations may be expected (MUST) or encouraged (SHOULD) to support the reading of the data, but writing will usually be optional (MAY). Examples of transitional metadata include custom additions by implementations that are later submitted as a formal specification. (See (bioformats2raw metadata)(#version0.4:bf2raw))
Some of the JSON examples in this document include comments. However, these are only for clarity purposes and comments MUST NOT be included in JSON objects. ## On-disk (or in-cloud) layout (version0.4:on-disk)= An overview of the layout of an OME-Zarr fileset should make understanding the following metadata sections easier. The hierarchy is represented here as it would appear locally but could equally be stored on a web server to be accessed via HTTP or in object storage like S3 or GCS. OME-Zarr is an implementation of the OME-NGFF specification using the Zarr format. Arrays MUST be defined and stored in a hierarchical organization as defined by the [version 2 of the Zarr specification ](https://zarr.readthedocs.io/en/stable/spec/v2.html). OME-NGFF metadata MUST be stored as attributes in the corresponding Zarr groups. ### Images (version0.4:image-layout)= The following layout describes the expected Zarr hierarchy for images with multiple levels of resolutions and optionally associated labels. Note that the number of dimensions is variable between 2 and 5 and that axis names are arbitrary, see (multiscales metadata)(#version0.4:multiscale-md) for details. For this example we assume an image with 5 dimensions and axes called `t,c,z,y,x`.
. # Root folder, potentially in S3,
│ # with a flat list of images by image ID.
│
├── 123.zarr # One image (id=123) converted to Zarr.
│
└── 456.zarr # Another image (id=456) converted to Zarr.
│
├── .zgroup # Each image is a Zarr group, or a folder, of other groups and arrays.
├── .zattrs # Group level attributes are stored in the .zattrs file and include
│ # "multiscales" and "omero" (see below). In addition, the group level attributes
│ # may also contain "_ARRAY_DIMENSIONS" for compatibility with xarray if this group directly contains multi-scale arrays.
│
├── 0 # Each multiscale level is stored as a separate Zarr array,
│ ... # which is a folder containing chunk files which compose the array.
├── n # The name of the array is arbitrary with the ordering defined by
│ │ # by the "multiscales" metadata, but is often a sequence starting at 0.
│ │
│ ├── .zarray # All image arrays must be up to 5-dimensional
│ │ # with the axis of type time before type channel, before spatial axes.
│ │
│ └─ t # Chunks are stored with the nested directory layout.
│ └─ c # All but the last chunk element are stored as directories.
│ └─ z # The terminal chunk is a file. Together the directory and file names
│ └─ y # provide the "chunk coordinate" (t, c, z, y, x), where the maximum coordinate
│ └─ x # will be `dimension_size / chunk_size`.
│
└── labels
│
├── .zgroup # The labels group is a container which holds a list of labels to make the objects easily discoverable
│
├── .zattrs # All labels will be listed in `.zattrs` e.g. `{ "labels": [ "original/0" ] }`
│ # Each dimension of the label `(t, c, z, y, x)` should be either the same as the
│ # corresponding dimension of the image, or `1` if that dimension of the label
│ # is irrelevant.
│
└── original # Intermediate folders are permitted but not necessary and currently contain no extra metadata.
│
└── 0 # Multiscale, labeled image. The name is unimportant but is registered in the "labels" group above.
├── .zgroup # Zarr Group which is both a multiscaled image as well as a labeled image.
├── .zattrs # Metadata of the related image and as well as display information under the "image-label" key.
│
├── 0 # Each multiscale level is stored as a separate Zarr array, as above, but only integer values
│ ... # are supported.
└── n
### High-content screening
(version0.4:hcs-layout)=
The following specification defines the hierarchy for a high-content screening
dataset. Three groups MUST be defined above the images:
- the group above the images defines the well and MUST implement the
[well specification](#version0.4:well-md). All images contained in a well are fields
of view of the same well
- the group above the well defines a row of wells
- the group above the well row defines an entire plate i.e. a two-dimensional
collection of wells organized in rows and columns. It MUST implement the
[plate specification](#version0.4:plate-md)
A well row group SHOULD NOT be present if there are no images in the well row.
A well group SHOULD NOT be present if there are no images in the well.
. # Root folder, potentially in S3,
│
└── 5966.zarr # One plate (id=5966) converted to Zarr
├── .zgroup
├── .zattrs # Implements "plate" specification
├── A # First row of the plate
│ ├── .zgroup
│ │
│ ├── 1 # First column of row A
│ │ ├── .zgroup
│ │ ├── .zattrs # Implements "well" specification
│ │ │
│ │ ├── 0 # First field of view of well A1
│ │ │ │
│ │ │ ├── .zgroup
│ │ │ ├── .zattrs # Implements "multiscales", "omero"
│ │ │ ├── 0
│ │ │ │ ... # Resolution levels
│ │ │ ├── n
│ │ │ └── labels # Labels (optional)
│ │ ├── ... # Fields of view
│ │ └── m
│ ├── ... # Columns
│ └── 12
├── ... # Rows
└── H
## Metadata
(version0.4:metadata)=
The various `.zattrs` files throughout the above array hierarchy may contain metadata
keys as specified below for discovering certain types of data, especially images.
### "axes" metadata
(version0.4:axes-md)
"axes" describes the dimensions of a physical coordinate space. It is a list of dictionaries, where each dictionary describes a dimension (axis) and:
- MUST contain the field "name" that gives the name for this dimension. The values MUST be unique across all "name" fields.
- SHOULD contain the field "type". It SHOULD be one of "space", "time" or "channel", but MAY take other string values for custom axis types that are not part of this specification yet.
- SHOULD contain the field "unit" to specify the physical unit of this dimension. The value SHOULD be one of the following strings, which are valid units according to UDUNITS-2.
- Units for "space" axes: 'angstrom', 'attometer', 'centimeter', 'decimeter', 'exameter', 'femtometer', 'foot', 'gigameter', 'hectometer', 'inch', 'kilometer', 'megameter', 'meter', 'micrometer', 'mile', 'millimeter', 'nanometer', 'parsec', 'petameter', 'picometer', 'terameter', 'yard', 'yoctometer', 'yottameter', 'zeptometer', 'zettameter'
- Units for "time" axes: 'attosecond', 'centisecond', 'day', 'decisecond', 'exasecond', 'femtosecond', 'gigasecond', 'hectosecond', 'hour', 'kilosecond', 'megasecond', 'microsecond', 'millisecond', 'minute', 'nanosecond', 'petasecond', 'picosecond', 'second', 'terasecond', 'yoctosecond', 'yottasecond', 'zeptosecond', 'zettasecond'
If part of (multiscales metadata)(#version0.4:multiscale-md), the length of "axes" MUST be equal to the number of dimensions of the arrays that contain the image data.
### "bioformats2raw.layout" (transitional)
(version0.4:bf2raw)=
[=Transitional=] "bioformats2raw.layout" metadata identifies a group which implicitly describes a series of images.
The need for the collection stems from the common "multi-image file" scenario in microscopy. Parsers like Bio-Formats
define a strict, stable ordering of the images in a single container that can be used to refer to them by other tools.
In order to capture that information within an OME-NGFF dataset, `bioformats2raw` internally introduced a wrapping layer.
The bioformats2raw layout has been added to v0.4 as a transitional specification to specify filesets that already exist
in the wild. An upcoming NGFF specification will replace this layout with explicit metadata.
#### Layout
(version0.4:bf2raw-layout)=
Typical Zarr layout produced by running `bioformats2raw` on a fileset that contains more than one image (series > 1):
series.ome.zarr # One converted fileset from bioformats2raw
├── .zgroup
├── .zattrs # Contains "bioformats2raw.layout" metadata
├── OME # Special group for containing OME metadata
│ ├── .zgroup
│ ├── .zattrs # Contains "series" metadata
│ └── METADATA.ome.xml # OME-XML file stored within the Zarr fileset
├── 0 # First image in the collection
├── 1 # Second image in the collection
└── ...
#### Attributes
(version0.4:bf2raw-attributes)=
The top-level `.zattrs` file must contain the `bioformats2raw.layout` key:
```{literalinclude} examples/bf2raw/image.json
language: json
```
If the top-level group represents a plate, the `bioformats2raw.layout` metadata will be present but
the "plate" key MUST also be present, takes precedence and parsing of such datasets should follow (plate metadata)(version0.4:plate-md). It is not
possible to mix collections of images with plates at present.
```{literalinclude} examples/bf2raw/plate.json
language: json
```
The `.zattrs` file within the OME group may contain the "series" key:
path: examples/ome/series-2.json highlight: json#### Details (version0.4:bf2raw-details)= Conforming groups: - MUST have the value "3" for the "bioformats2raw.layout" key in their `.zattrs` metadata at the top of the hierarchy; - SHOULD have OME metadata representing the entire collection of images in a file named "OME/METADATA.ome.xml" which: - MUST adhere to the OME-XML specification but - MUST use `
| `identity` | identity transformation, is the default transformation and is typically not explicitly defined | ||||
|---|---|---|---|---|---|
| `translation` | one of: `"translation":List[float]`, `"path":str` | translation vector, stored either as a list of floats (`"translation"`) or as binary data at a location in this container (`path`). The length of vector defines number of dimensions. | | |||
| `scale` | one of: `"scale":List[float]`, `"path":str` | scale vector, stored either as a list of floats (`scale`) or as binary data at a location in this container (`path`). The length of vector defines number of dimensions. |
| type | fields | description
| |
All implementations prevent an equivalent representation of a dataset which can be downloaded or uploaded freely. An interactive
version of this diagram is available from the [OME2020 Workshop](https://downloads.openmicroscopy.org/presentations/2020/Dundee/Workshops/NGFF/zarr_diagram/).
Mouseover the blackboxes representing the implementations above to get a quick tip on how to use them.
Note: If you would like to see your project listed, please open an issue or PR on the [ome/ngff](https://github.com/ome/ngff) repository.
## Other resources
```{toctree}
:maxdepth: 1
examples/index
schemas/index
```