Grib2 scan_grib: Dimensions are Out of Order
See original GitHub issueI’ve been attempting to follow this kerchunk grib2 example using HRRR data, but I have encountered an interesting issue that prevents me from being able to use the Dataset in my intended manner due to the mis-naming (or mis-ordering) of X and Y dimensions for arrays.
Running the example_combine from the documentation (copied below) and printing the dataset shows the issue: the arrays are labeled as (time, x, y)
, but the actual ordering of the grib2 data is (time, y, x)
. 1799 is the X extent of the domain and 1059 is the Y extent of the domain, yet in the dimensions, they are labeled opposite. Attempting to perform operations on the Dataset when working with X and Y coordinate names doesn’t work because of this behavior.
I’ve attempted manually modifying the kerchunk/grib2.py file to re-order occurrences of ["x", "y"]
, but sadly the solution doesn’t appear to be that simple. Any ideas on how to make sure that the array dimensions are labeled correctly?
Script Output
<xarray.Dataset>
Dimensions: (time: 1, x: 1059, y: 1799, heightAboveGround: 1)
Coordinates:
* heightAboveGround (heightAboveGround) int64 2
* time (time) datetime64[ns] 2019-01-01T22:00:00
Dimensions without coordinates: x, y
Data variables:
2d (time, x, y) float64 dask.array<chunksize=(1, 1059, 1799), meta=np.ndarray>
2r (time, x, y) float64 dask.array<chunksize=(1, 1059, 1799), meta=np.ndarray>
2sh (time, x, y) float64 dask.array<chunksize=(1, 1059, 1799), meta=np.ndarray>
2t (time, x, y) float64 dask.array<chunksize=(1, 1059, 1799), meta=np.ndarray>
latitude (x, y) float64 dask.array<chunksize=(1059, 1799), meta=np.ndarray>
longitude (x, y) float64 dask.array<chunksize=(1059, 1799), meta=np.ndarray>
pt (time, x, y) float64 dask.array<chunksize=(1, 1059, 1799), meta=np.ndarray>
Attributes:
centre: kwbc
centreDescription: US National Weather Service - NCEP
edition: 2
subCentre: 0
Example Script
import xarray as xr
import fsspec
from kerchunk.grib2 import scan_grib
def example_combine(
filter={"typeOfLevel": "heightAboveGround", "level": 2}
): # pragma: no cover
"""Create combined dataset of weather measurements at 2m height
Ten consecutive timepoints from ten 120MB files on s3.
Example usage:
>>> tot = example_combine()
>>> ds = xr.open_dataset("reference://", engine="zarr", backend_kwargs={
... "consolidated": False,
... "storage_options": {"fo": tot, "remote_options": {"anon": True}}})
"""
from kerchunk.combine import MultiZarrToZarr, drop
files = [
"s3://noaa-hrrr-bdp-pds/hrrr.20190101/conus/hrrr.t22z.wrfsfcf01.grib2",
]
so = {"anon": True, "default_cache_type": "readahead"}
out = [scan_grib(u, storage_options=so, filter=filter) for u in files]
out = sum(out, [])
mzz = MultiZarrToZarr(
out,
remote_protocol="s3",
preprocess=drop(("valid_time", "step")),
remote_options=so,
concat_dims=["time", "var"],
identical_dims=["heightAboveGround", "latitude", "longitude"],
)
return mzz.translate()
def main():
d = example_combine()
fs = fsspec.filesystem("reference", fo=d, remote_protocol='s3', remote_options={'anon':True})
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine="zarr", backend_kwargs=dict(consolidated=False),
chunks={'valid_time':1})
print(ds)
if __name__ == "__main__":
main()
Issue Analytics
- State:
- Created 9 months ago
- Comments:14 (8 by maintainers)
Top GitHub Comments
I’ll chime in and agree that most (if not all) geospatial grib data I have worked with are in C-style, row-major ordered layout (y, x), (lat, lon), (time, lat, lon), (time, level, lat, lon), etc… I am sure there are specific edge cases, but at least for NOAA/NWS data, this is the norm.
That looks right to me!!