Skip to content

Commit 1510de7

Browse files
JackKellyaldenks
andauthored
ICON-EU step 1: Implement template_config.py (#184)
* initialize-new-integration dwd icon-eu forecast. And register. * Gemini CLI's first draft. And I've started updating. * All that's left to do in template_config is check the data_vars * First complete draft of ICON-EU template_config.py * Add fields for all DataVarAttrs from ICON PDF doc. * Reformat the comment strings so they are all under 100 chars long. * Add a note about alb_rad not being included * Implemented first test for template_config. Add latest.zarr * Use the correct CRS in template_config and its test * Update the DataVarAttrs with short_name and standard_name from ICON metadata. * Update latest.zarr metadata with short_name and standard_name from GRIB metadata * Removing dynamical_dataset.py and region_job.py from git repo. These files are currently just the output from the initialize-new-integration tool. These files will be added to a subsequent PR once I've modified them. * Add expected_forecast_length. * Update append_dim_start datetime, as per Alden's review * Add dwd/__init__.py and dwd/icon_eu/__init__.py as per Alden's review, in the hopes of fixing the mypy failure. mypy passes locally. * Remove noqa: F401, as per Alden's review * Replace hyphen with underscore in ICON EU dataset_id As per Alden's review Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Add a comment confirming that the coords are for pixel centers * Comment that the CRS is a perfect sphere extracted from GRIB. As per Alden's PR review Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * lead_time is 93 steps NOT 120 steps!\n\nBug found by Alden in PR review. * Change wind_v_10 to wind_v_10m Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Change relative_humidity to relative_humidity_2m Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Change pressure_reduced_to_msl to pressure_reduced_to_mean_sea_level Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Use chunks with 165 pixels in the latitude dim As per Alden's review * Add comment about chunk sizes in MB Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Add comment about size of shards in MB Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Remove the text 'mean over forecast time'. As per Alden's review * Replace 'total_precipitation' with 'precipitation_surface' Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Add unmodified files created by initialize-new-integration * Update pytest to 8.4.1. Doesn't fix the issue yet. * Adding missing __init__.py files to test directories * Change latitude chunks to 219 so pydantic tests pass. And update comment about chunk sizes * Change `wind_u_10` to `wind_u_10m` Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com> * Update Zarr metadata. Tests pass. * Re-run initialize-new-integration after merging with main branch --------- Co-authored-by: Alden Keefe Sampson <aldenkeefesampson@gmail.com>
1 parent 1becc8f commit 1510de7

File tree

50 files changed

+5182
-5
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+5182
-5
lines changed

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ dev-dependencies = [
3838
"pandas-stubs>=2.2.2.240909",
3939
"pre-commit>=3.8.0",
4040
"pyqt6>=6.7.1",
41-
"pytest>=8.3.4",
41+
"pytest>=8.4.1",
4242
"ruff==0.12.1",
4343
"types-requests>=2.32.0.20240914",
4444
]

src/reformatters/__main__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
NoaaNdviCdrAnalysisDataset,
1616
)
1717
from reformatters.contrib.uarizona.swann.analysis import UarizonaSwannAnalysisDataset
18+
from reformatters.dwd.icon_eu.forecast import DwdIconEuForecastDataset
1819
from reformatters.example.new_dataset import initialize_new_integration
1920
from reformatters.noaa.gfs.forecast import NoaaGfsForecastDataset
2021
from reformatters.noaa.hrrr.forecast_48_hour.dynamical_dataset import (
@@ -52,6 +53,7 @@ class UpstreamGriddedZarrsDatasetStorageConfig(StorageConfig):
5253
storage_config=UpstreamGriddedZarrsDatasetStorageConfig()
5354
),
5455
NoaaGfsForecastDataset(storage_config=SourceCoopDatasetStorageConfig()),
56+
DwdIconEuForecastDataset(storage_config=SourceCoopDatasetStorageConfig()),
5557
NoaaHrrrForecast48HourDataset(storage_config=SourceCoopDatasetStorageConfig()),
5658
]
5759

src/reformatters/dwd/__init__.py

Whitespace-only changes.

src/reformatters/dwd/icon_eu/__init__.py

Whitespace-only changes.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from .dynamical_dataset import DwdIconEuForecastDataset as DwdIconEuForecastDataset
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
from collections.abc import Sequence
2+
3+
from reformatters.common import validation
4+
from reformatters.common.dynamical_dataset import DynamicalDataset
5+
from reformatters.common.kubernetes import CronJob
6+
7+
from .region_job import DwdIconEuForecastRegionJob, DwdIconEuForecastSourceFileCoord
8+
from .template_config import DwdIconEuDataVar, DwdIconEuForecastTemplateConfig
9+
10+
11+
class DwdIconEuForecastDataset(
12+
DynamicalDataset[DwdIconEuDataVar, DwdIconEuForecastSourceFileCoord]
13+
):
14+
template_config: DwdIconEuForecastTemplateConfig = DwdIconEuForecastTemplateConfig()
15+
region_job_class: type[DwdIconEuForecastRegionJob] = DwdIconEuForecastRegionJob
16+
17+
def operational_kubernetes_resources(self, image_tag: str) -> Sequence[CronJob]:
18+
"""Return the kubernetes cron job definitions to operationally update and validate this dataset."""
19+
# operational_update_cron_job = ReformatCronJob(
20+
# name=f"{self.dataset_id}-operational-update",
21+
# schedule=_OPERATIONAL_CRON_SCHEDULE,
22+
# pod_active_deadline=timedelta(minutes=30),
23+
# image=image_tag,
24+
# dataset_id=self.dataset_id,
25+
# cpu="14",
26+
# memory="30G",
27+
# shared_memory="12G",
28+
# ephemeral_storage="30G",
29+
# secret_names=self.storage_config.k8s_secret_names,
30+
# )
31+
# validation_cron_job = ValidationCronJob(
32+
# name=f"{self.dataset_id}-validation",
33+
# schedule=_VALIDATION_CRON_SCHEDULE,
34+
# pod_active_deadline=timedelta(minutes=10),
35+
# image=image_tag,
36+
# dataset_id=self.dataset_id,
37+
# cpu="1.3",
38+
# memory="7G",
39+
# secret_names=self.storage_config.k8s_secret_names,
40+
# )
41+
42+
# return [operational_update_cron_job, validation_cron_job]
43+
raise NotImplementedError(
44+
f"Implement `operational_kubernetes_resources` on {self.__class__.__name__}"
45+
)
46+
47+
def validators(self) -> Sequence[validation.DataValidator]:
48+
"""Return a sequence of DataValidators to run on this dataset."""
49+
# return (
50+
# validation.check_analysis_current_data,
51+
# validation.check_analysis_recent_nans,
52+
# )
53+
raise NotImplementedError(
54+
f"Implement `validators` on {self.__class__.__name__}"
55+
)
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
from collections.abc import Callable, Mapping, Sequence
2+
from pathlib import Path
3+
4+
import xarray as xr
5+
6+
from reformatters.common.logging import get_logger
7+
from reformatters.common.region_job import (
8+
CoordinateValueOrRange,
9+
RegionJob,
10+
SourceFileCoord,
11+
)
12+
from reformatters.common.storage import StoreFactory
13+
from reformatters.common.types import (
14+
AppendDim,
15+
ArrayFloat32,
16+
DatetimeLike,
17+
Dim,
18+
)
19+
20+
from .template_config import DwdIconEuDataVar
21+
22+
log = get_logger(__name__)
23+
24+
25+
class DwdIconEuForecastSourceFileCoord(SourceFileCoord):
26+
"""Coordinates of a single source file to process."""
27+
28+
def get_url(self) -> str:
29+
raise NotImplementedError("Return the URL of the source file.")
30+
31+
def out_loc(
32+
self,
33+
) -> Mapping[Dim, CoordinateValueOrRange]:
34+
"""
35+
Returns a data array indexer which identifies the region in the output dataset
36+
to write the data from the source file. The indexer is a dict from dimension
37+
names to coordinate values or slices.
38+
"""
39+
# If the names of the coordinate attributes of your SourceFileCoord subclass are also all
40+
# dimension names in the output dataset (e.g. init_time and lead_time),
41+
# delete this implementation and use the default implementation of this method.
42+
#
43+
# Examples where you would override this method:
44+
# - An analysis dataset created from forecast data:
45+
# return {"time": self.init_time + self.lead_time}
46+
return super().out_loc()
47+
48+
49+
class DwdIconEuForecastRegionJob(
50+
RegionJob[DwdIconEuDataVar, DwdIconEuForecastSourceFileCoord]
51+
):
52+
# Optionally, limit the number of variables downloaded together.
53+
# If set to a value less than len(data_vars), downloading, reading/recompressing,
54+
# and uploading steps will be pipelined within a region job.
55+
# 5 is a reasonable default if it is possible to download less than all
56+
# variables in a single file (e.g. you have a grib index).
57+
# Leave unset if you have to download a whole file to get one variable out
58+
# to avoid re-downloading the same file multiple times.
59+
#
60+
# max_vars_per_download_group: ClassVar[int | None] = None
61+
62+
# Implement this method only if different variables must be retrieved from different urls
63+
#
64+
# # @classmethod
65+
# def source_groups(
66+
# cls,
67+
# data_vars: Sequence[DwdIconEuDataVar],
68+
# ) -> Sequence[Sequence[DwdIconEuDataVar]]:
69+
# """
70+
# Return groups of variables, where all variables in a group can be retrieived from the same source file.
71+
# """
72+
# grouped = defaultdict(list)
73+
# for data_var in data_vars:
74+
# grouped[data_var.internal_attrs.file_type].append(data_var)
75+
# return list(grouped.values())
76+
77+
# Implement this method only if specific post processing in this dataset
78+
# requires data from outside the region defined by self.region,
79+
# e.g. for deaccumulation or interpolation along append_dim in an analysis dataset.
80+
#
81+
# def get_processing_region(self) -> slice:
82+
# """
83+
# Return a slice of integer offsets into self.template_ds along self.append_dim that identifies
84+
# the region to process. In most cases this is exactly self.region, but if additional data outside
85+
# the region is required, for example for correct interpolation or deaccumulation, this method can
86+
# return a modified slice (e.g. `slice(self.region.start - 1, self.region.stop + 1)`).
87+
# """
88+
# return self.region
89+
90+
def generate_source_file_coords(
91+
self,
92+
processing_region_ds: xr.Dataset,
93+
data_var_group: Sequence[DwdIconEuDataVar],
94+
) -> Sequence[DwdIconEuForecastSourceFileCoord]:
95+
"""Return a sequence of coords, one for each source file required to process the data covered by processing_region_ds."""
96+
# return [
97+
# DwdIconEuForecastSourceFileCoord(
98+
# init_time=init_time,
99+
# lead_time=lead_time,
100+
# )
101+
# for init_time, lead_time in itertools.product(
102+
# processing_region_ds["init_time"].values,
103+
# processing_region_ds["lead_time"].values,
104+
# )
105+
# ]
106+
raise NotImplementedError(
107+
"Return a sequence of SourceFileCoord objects, one for each source file required to process the data covered by processing_region_ds."
108+
)
109+
110+
def download_file(self, coord: DwdIconEuForecastSourceFileCoord) -> Path:
111+
"""Download the file for the given coordinate and return the local path."""
112+
# return http_download_to_disk(coord.get_url(), self.dataset_id)
113+
raise NotImplementedError(
114+
"Download the file for the given coordinate and return the local path."
115+
)
116+
117+
def read_data(
118+
self,
119+
coord: DwdIconEuForecastSourceFileCoord,
120+
data_var: DwdIconEuDataVar,
121+
) -> ArrayFloat32:
122+
"""Read and return an array of data for the given variable and source file coordinate."""
123+
# with rasterio.open(coord.downloaded_file_path) as reader:
124+
# TODO: make a band index based on tag matching utility function
125+
# matching_indexes = [
126+
# i
127+
# for i in range(reader.count)
128+
# if (tags := reader.tags(i))["GRIB_ELEMENT"]
129+
# == data_var.internal_attrs.grib_element
130+
# and tags["GRIB_COMMENT"] == data_var.internal_attrs.grib_comment
131+
# ]
132+
# assert len(matching_indexes) == 1, f"Expected exactly 1 matching band, found {matching_indexes}. {data_var.internal_attrs.grib_element=}, {data_var.internal_attrs.grib_description=}, {coord.downloaded_file_path=}"
133+
# rasterio_band_index = 1 + matching_indexes[0] # rasterio is 1-indexed
134+
# return reader.read(rasterio_band_index, dtype=np.float32)
135+
raise NotImplementedError(
136+
"Read and return data for the given variable and source file coordinate."
137+
)
138+
139+
# Implement this to apply transformations to the array (e.g. deaccumulation)
140+
#
141+
# def apply_data_transformations(
142+
# self, data_array: xr.DataArray, data_var: DwdIconEuDataVar
143+
# ) -> None:
144+
# """
145+
# Apply in-place data transformations to the output data array for a given data variable.
146+
147+
# This method is called after reading all data for a variable into the shared-memory array,
148+
# and before writing shards to the output store. The default implementation applies binary
149+
# rounding to float32 arrays if `data_var.internal_attrs.keep_mantissa_bits` is set.
150+
151+
# Subclasses may override this method to implement additional transformations such as
152+
# deaccumulation, interpolation or other custom logic. All transformations should be
153+
# performed in-place (don't copy `data_array`, it's large).
154+
155+
# Parameters
156+
# ----------
157+
# data_array : xr.DataArray
158+
# The output data array to be transformed in-place.
159+
# data_var : DwdIconEuDataVar
160+
# The data variable metadata object, which may contain transformation parameters.
161+
# """
162+
# super().apply_data_transformations(data_array, data_var)
163+
164+
def update_template_with_results(
165+
self, process_results: Mapping[str, Sequence[DwdIconEuForecastSourceFileCoord]]
166+
) -> xr.Dataset:
167+
"""
168+
Update template dataset based on processing results. This method is called
169+
during operational updates.
170+
171+
Subclasses should implement this method to apply dataset-specific adjustments
172+
based on the processing results. Examples include:
173+
- Trimming dataset along append_dim to only include successfully processed data
174+
- Loading existing coordinate values from the primary store and updating them based on results
175+
- Updating metadata based on what was actually processed vs what was planned
176+
177+
The default implementation trims along append_dim to end at the most recent
178+
successfully processed coordinate (timestamp).
179+
180+
Parameters
181+
----------
182+
process_results : Mapping[str, Sequence[DwdIconEuForecastSourceFileCoord]]
183+
Mapping from variable names to their source file coordinates with final processing status.
184+
185+
Returns
186+
-------
187+
xr.Dataset
188+
Updated template dataset reflecting the actual processing results.
189+
"""
190+
# The super() implementation looks like this:
191+
#
192+
# max_append_dim_processed = max(
193+
# (
194+
# c.out_loc()[self.append_dim] # type: ignore[type-var]
195+
# for c in chain.from_iterable(process_results.values())
196+
# if c.status == SourceFileStatus.Succeeded
197+
# ),
198+
# default=None,
199+
# )
200+
# if max_append_dim_processed is None:
201+
# # No data was processed, trim the template to stop before this job's region
202+
# # This is using isel's exclusive slice end behavior
203+
# return self.template_ds.isel(
204+
# {self.append_dim: slice(None, self.region.start)}
205+
# )
206+
# else:
207+
# return self.template_ds.sel(
208+
# {self.append_dim: slice(None, max_append_dim_processed)}
209+
# )
210+
#
211+
# If you like the above behavior, skip implementing this method.
212+
# If you need to customize the behavior, implement this method.
213+
214+
raise NotImplementedError(
215+
"Subclasses implement update_template_with_results() with dataset-specific logic"
216+
)
217+
218+
@classmethod
219+
def operational_update_jobs(
220+
cls,
221+
primary_store_factory: StoreFactory,
222+
tmp_store: Path,
223+
get_template_fn: Callable[[DatetimeLike], xr.Dataset],
224+
append_dim: AppendDim,
225+
all_data_vars: Sequence[DwdIconEuDataVar],
226+
reformat_job_name: str,
227+
) -> tuple[
228+
Sequence["RegionJob[DwdIconEuDataVar, DwdIconEuForecastSourceFileCoord]"],
229+
xr.Dataset,
230+
]:
231+
"""
232+
Return the sequence of RegionJob instances necessary to update the dataset
233+
from its current state to include the latest available data.
234+
235+
Also return the template_ds, expanded along append_dim through the end of
236+
the data to process. The dataset returned here may extend beyond the
237+
available data at the source, in which case `update_template_with_results`
238+
will trim the dataset to the actual data processed.
239+
240+
The exact logic is dataset-specific, but it generally follows this pattern:
241+
1. Figure out the range of time to process: append_dim_start (inclusive) and append_dim_end (exclusive)
242+
a. Read existing data from the primary store to determine what's already processed
243+
b. Optionally identify recent incomplete/non-final data for reprocessing
244+
2. Call get_template_fn(append_dim_end) to get the template_ds
245+
3. Create RegionJob instances by calling cls.get_jobs(..., filter_start=append_dim_start)
246+
247+
Parameters
248+
----------
249+
primary_store_factory : StoreFactory
250+
The factory to get the primary store to read existing data from and write updates to.
251+
tmp_store : Path
252+
The temporary Zarr store to write into while processing.
253+
get_template_fn : Callable[[DatetimeLike], xr.Dataset]
254+
Function to get the template_ds for the operational update.
255+
append_dim : AppendDim
256+
The dimension along which data is appended (e.g., "time").
257+
all_data_vars : Sequence[DwdIconEuDataVar]
258+
Sequence of all data variable configs for this dataset.
259+
reformat_job_name : str
260+
The name of the reformatting job, used for progress tracking.
261+
This is often the name of the Kubernetes job, or "local".
262+
263+
Returns
264+
-------
265+
Sequence[RegionJob[DwdIconEuDataVar, DwdIconEuForecastSourceFileCoord]]
266+
RegionJob instances that need processing for operational updates.
267+
xr.Dataset
268+
The template_ds for the operational update.
269+
"""
270+
# existing_ds = xr.open_zarr(primary_store_factory.store())
271+
# append_dim_start = existing_ds[append_dim].max()
272+
# append_dim_end = pd.Timestamp.now()
273+
# template_ds = get_template_fn(append_dim_end)
274+
275+
# jobs = cls.get_jobs(
276+
# kind="operational-update",
277+
# primary_store_factory=primary_store_factory,
278+
# tmp_store=tmp_store,
279+
# template_ds=template_ds,
280+
# append_dim=append_dim,
281+
# all_data_vars=all_data_vars,
282+
# reformat_job_name=reformat_job_name,
283+
# filter_start=append_dim_start,
284+
# )
285+
# return jobs, template_ds
286+
287+
raise NotImplementedError(
288+
"Subclasses implement operational_update_jobs() with dataset-specific logic"
289+
)

0 commit comments

Comments
 (0)