Datasets#

This page provides an overview of how datasets are defined, structured, and handled in the processing chain. The goal is to ensure consistent, analysis-ready data products that can be accessed efficiently and extended over time.

What is a Dataset?#

A dataset is the main unit of storage and analysis. A dataset:

represents a rectangular collection of variables with shared coordinates
is stored in Zarr format
corresponds to one deployment of an instrument or product

Datasets are designed so that they can be accessed directly for analysis via our intake catalog.

Analysis-Ready Datasets#

Our goal is to produce analysis-ready datasets. These datasets should be usable for scientific analysis without additional preprocessing.

An analysis-ready dataset should:

represent the whole observation period
follow CF conventions where possible
include appropriate metadata and units

Dataset Organization#

Datasets follow a hierarchical naming scheme that reflects their observational context.

The hierarchy is:

platform.campaign.instrument

or, if no campaign is relevant:

platform.instrument

The instrument name should include configuration (_c1) and version (_v1) information.

Examples:

BCO.surfacemet_wxt_v1
BCO.lidar_CORAL_LR_t_c1_v1
METEOR.EUREC4A.lidar_LICHT_LR_t_v1

Naming rules:

. separates hierarchical levels
_ is used within names

Datasets can be accessed through the intake catalog, for example:

import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/catalog.yaml")
cat.BCO.surfacemet_wxt_v1.to_dask()

/builds/tco/bco/docs/.venv/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),

<xarray.Dataset> Size: 6GB
Dimensions:                 (time: 43806390, bnd: 2)
Coordinates:
    alt                     float64 8B ...
    lat                     float64 8B ...
    lon                     float64 8B ...
  * time                    (time) datetime64[ns] 350MB 2010-12-16T16:24:00 ....
Dimensions without coordinates: bnd
Data variables: (12/25)
    DIR                     (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    DL                      (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    DR                      (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
    H                       (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
    HDS                     (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
    HI                      (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
    ...                      ...
    VR                      (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
    VS                      (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
    air_temperature_status  (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
    sensor_location         (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
    time_bounds             (time, bnd) datetime64[ns] 701MB dask.array<chunksize=(262144, 2), meta=np.ndarray>
    wind_direction_status   (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
Attributes:
    Conventions:           CF-1.12
    _logical_cutoff_date:  2026-04-14T00:00:00Z
    bcoproc_version:       0.0.0.post1345.dev0+2215fe5
    featureType:           timeSeries
    institution:           Max Planck Institute for Meteorology, Hamburg
    license:               CC0-1.0
    location:              The Barbados Cloud Observatory (BCO), Deebles Poin...
    platform:              BCO
    source:                Vaisala WXT-520
    summary:               This dataset contains basic meteorological measure...
    title:                 WXT-2 ground station data from BCO (Level 1)
    tool_versions:         {"Python": "3.11.2 (main, Apr 28 2025, 14:11:48) [...

Coordinate Conventions#

Datasets follow CF conventions for coordinate naming where possible.

Primary coordinates include:

Coordinate	Meaning
`time`	UTC timestamp
`alt`	altitude above the geoid (meters)
`range`	line-of-sight distance from instrument (meters)
`lat`	latitude (`degrees_north`)
`lon`	longitude (`degrees_east`)

Primary coordinates must:

be strictly monotonic
contain no missing values

When sensor and data coordinates differ, sensor coordinates are provided using the prefix sensor_, e.g., sensor_alt.

Incrementally Growing Datasets#

Datasets are stored as Zarr archives and are extended continuously as new data becomes available. Rather than rewriting datasets, new data is appended as additional chunks. This enables efficient cloud-based storage and scalable analysis. The processing chaing orchestrated by an Airflow server.

Datasets

Contents