Datasets#
This page provides an overview of how datasets are defined, structured, and handled in the processing chain. The goal is to ensure consistent, analysis-ready data products that can be accessed efficiently and extended over time.
What is a Dataset?#
A dataset is the main unit of storage and analysis. A dataset:
represents a rectangular collection of variables with shared coordinates
is stored in Zarr format
corresponds to one deployment of an instrument or product
Datasets are designed so that they can be accessed directly for analysis via our intake catalog.
Analysis-Ready Datasets#
Our goal is to produce analysis-ready datasets. These datasets should be usable for scientific analysis without additional preprocessing.
An analysis-ready dataset should:
represent the whole observation period
follow CF conventions where possible
include appropriate metadata and units
Dataset Organization#
Datasets follow a hierarchical naming scheme that reflects their observational context.
The hierarchy is:
platform.campaign.instrument
or, if no campaign is relevant:
platform.instrument
The instrument name should include configuration (_c1) and version (_v1) information.
Examples:
BCO.surfacemet_wxt_v1
BCO.lidar_CORAL_LR_t_c1_v1
METEOR.EUREC4A.lidar_LICHT_LR_t_v1
Naming rules:
.separates hierarchical levels_is used within names
Datasets can be accessed through the intake catalog, for example:
import intake
cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/catalog.yaml")
cat.BCO.surfacemet_wxt_v1.to_dask()
/builds/tco/bco/docs/.venv/lib/python3.12/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 6GB
Dimensions: (time: 43806390, bnd: 2)
Coordinates:
alt float64 8B ...
lat float64 8B ...
lon float64 8B ...
* time (time) datetime64[ns] 350MB 2010-12-16T16:24:00 ....
Dimensions without coordinates: bnd
Data variables: (12/25)
DIR (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
DL (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
DR (time) float64 350MB dask.array<chunksize=(262144,), meta=np.ndarray>
H (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
HDS (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
HI (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
... ...
VR (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
VS (time) float32 175MB dask.array<chunksize=(262144,), meta=np.ndarray>
air_temperature_status (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
sensor_location (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
time_bounds (time, bnd) datetime64[ns] 701MB dask.array<chunksize=(262144, 2), meta=np.ndarray>
wind_direction_status (time) int8 44MB dask.array<chunksize=(262144,), meta=np.ndarray>
Attributes:
Conventions: CF-1.12
_logical_cutoff_date: 2026-04-14T00:00:00Z
bcoproc_version: 0.0.0.post1345.dev0+2215fe5
featureType: timeSeries
institution: Max Planck Institute for Meteorology, Hamburg
license: CC0-1.0
location: The Barbados Cloud Observatory (BCO), Deebles Poin...
platform: BCO
source: Vaisala WXT-520
summary: This dataset contains basic meteorological measure...
title: WXT-2 ground station data from BCO (Level 1)
tool_versions: {"Python": "3.11.2 (main, Apr 28 2025, 14:11:48) [...Coordinate Conventions#
Datasets follow CF conventions for coordinate naming where possible.
Primary coordinates include:
Coordinate |
Meaning |
|---|---|
|
UTC timestamp |
|
altitude above the geoid (meters) |
|
line-of-sight distance from instrument (meters) |
|
latitude ( |
|
longitude ( |
Primary coordinates must:
be strictly monotonic
contain no missing values
When sensor and data coordinates differ, sensor coordinates are provided using the prefix sensor_, e.g., sensor_alt.
Incrementally Growing Datasets#
Datasets are stored as Zarr archives and are extended continuously as new data becomes available. Rather than rewriting datasets, new data is appended as additional chunks. This enables efficient cloud-based storage and scalable analysis. The processing chaing orchestrated by an Airflow server.