---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Datasets

This page provides an overview of how **datasets are defined, structured, and handled** in the processing chain.
The goal is to ensure consistent, analysis-ready data products that can be accessed efficiently and extended over time.

## What is a Dataset?

A **dataset** is the main unit of storage and analysis.  A dataset:
* represents a **rectangular collection of variables with shared coordinates**
* is stored in [**Zarr format**](https://zarr.dev)
* corresponds to **one deployment** of an instrument or product

Datasets are designed so that they can be accessed directly for analysis via our intake catalog.

## Analysis-Ready Datasets

Our goal is to produce **analysis-ready datasets**.
These datasets should be usable for scientific analysis without additional preprocessing.

An analysis-ready dataset should:
* represent the **whole observation period**
* follow [**CF conventions**](https://cfconventions.org/) where possible
* include appropriate **metadata and units**

## Dataset Organization

Datasets follow a hierarchical naming scheme that reflects their observational context.

The hierarchy is:

    platform.campaign.instrument

or, if no campaign is relevant:

    platform.instrument

The instrument name should include configuration (`_c1`) and version (`_v1`) information.

Examples:

    BCO.surfacemet_wxt_v1
    BCO.lidar_CORAL_LR_t_c1_v1
    METEOR.EUREC4A.lidar_LICHT_LR_t_v1

Naming rules:

* `.` separates **hierarchical levels**
* `_` is used **within names**

Datasets can be accessed through the intake catalog, for example:

```{code-cell} ipython3
import intake


cat = intake.open_catalog("https://tcodata.mpimet.mpg.de/catalog.yaml")
cat.BCO.surfacemet_wxt_v1.to_dask()
```

## Coordinate Conventions

Datasets follow **CF conventions** for coordinate naming where possible.

Primary coordinates include:

Coordinate | Meaning
--- | ---
`time`| UTC timestamp
`alt` | altitude above the geoid (meters)
`range` | line-of-sight distance from instrument (meters)
`lat` | latitude (`degrees_north`)
`lon` | longitude (`degrees_east`)

Primary coordinates must:
* be **strictly monotonic**
* contain **no missing values**

When sensor and data coordinates differ, **sensor coordinates** are provided using the prefix `sensor_`, e.g., `sensor_alt`.

## Incrementally Growing Datasets

Datasets are stored as **Zarr archives** and are **extended continuously** as new data becomes available.
Rather than rewriting datasets, new data is **appended as additional chunks**.
This enables efficient cloud-based storage and scalable analysis.
The processing chaing orchestrated by an Airflow server.
