question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scan_grib is not outputting all the pieces dimensions correctly.

See original GitHub issue

I am very new at this, so I might not comprehend the exact goal of scan_grib but here we go (this is a follow up from discourse):

What’s the issue: scan_grib doesn’t outputs all the “parts” that I expect to see in the json files, hence when i read the file through xarray I am missing parts of it. Something is wrong in the writing of the parsing of the file that comes out of _split_file. I explored a bit, but couldn’t find what the issue is.

grib_file to download: !wget https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2

versions I am using: Screen Shot 2022-04-14 at 3 13 41 PM

import xarray as xr
import cfgrib

dsx = xr.open_dataset('acpcp_sfc_2000010100_c00.grib2', engine='cfgrib')
dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')

in both cases (which are essentially the same thing) the dimensions are recognized correctly, in particular:

dsx Screen Shot 2022-04-14 at 2 53 45 PM

dscf

Dataset(
dimensions={'step': 80, 'latitude': 721, 'longitude': 1440}, 
variables={'number': Variable(dimensions=(), data=0), 
'time': Variable(dimensions=(), data=946684800), 
'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
                    ...condensed..for...clarity    
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 
'surface': Variable(dimensions=(), data=0.0),
'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
                   ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 
'longitude': Variable(dimensions=('longitude',), data=array([0.0000e+00, 2.5000e-01, 5.0000e-01, ..., 3.5925e+02, 3.5950e+02,
       3.5975e+02])), 'valid_time': Variable(dimensions=('step',), data=array([9.466956e+08, 9.467064e+08, 9.467172e+08, 9.467280e+08,
                  ...condensed..for...clarity    
       9.475164e+08, 9.475272e+08, 9.475380e+08, 9.475488e+08])), 
'acpcp': Variable(dimensions=('step', 'latitude', 'longitude'), 
data=OnDiskArray(index=FileIndex(fieldset=FileStream(path='acpcp_sfc_2000010100_c00.grib2', errors='warn'), index_keys=['centre', 'centreDescription', 'dataDate', 'dataTime', 'dataType', 'directionNumber', 'edition', 'endStep', 'frequencyNumber', 'gridType', 'level:float', 'number', 'numberOfPoints', 'paramId', 'step', 'stepType', 'stepUnits', 'subCentre', 'time', 'typeOfLevel'], filter_by_keys={'paramId': 3063}, computed_keys={}, index_protocol_version='2'), shape=(80, 721, 1440), missing_value=9999))}, attributes={'GRIB_edition': 2, 'GRIB_centre': 'kwbc', 'GRIB_centreDescription': 'US National Weather Service - NCEP', 'GRIB_subCentre': 2, 'Conventions': 'CF-1.7', 'institution': 'US National Weather Service - NCEP', 'history': '2022-04-14T18:50 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {"source": "acpcp_sfc_2000010100_c00.grib2", "filter_by_keys": {}, "encode_cf": ["parameter", "time", "geography", "vertical"]}'}, encoding={'source': 'acpcp_sfc_2000010100_c00.grib2', 'filter_by_keys': {}, 'encode_cf': ('parameter', 'time', 'geography', 'vertical')})

Note: step is a dimension, just like latitude and longitude and time. time is particular, has no dimension and one data, step, just like latitude and longitude, has values in all the fields, again, in particular, i see no difference between:

'step': Variable(dimensions=('step',), data=array([  3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.,  30.,  33.,
        ...condensed..for...clarity
       201., 204., 207., 210., 213., 216., 219., 222., 225., 228., 231.,
       234., 237., 240.])), 

and

'latitude': Variable(dimensions=('latitude',), data=array([ 90.  ,  89.75,  89.5 ,  89.25,  89.  ,  88.75,  88.5 ,  88.25,
        88.  ,  87.75,  87.5 ,  87.25,  87.  ,  86.75,  86.5 ,  86.25,
            ...condensed..for...clarity    
       -88.  , -88.25, -88.5 , -88.75, -89.  , -89.25, -89.5 , -89.75,
       -90.  ])), 

however, when I do:

!pip install kerchunk
import warnings
warnings.simplefilter("ignore")  # filter some warning messages
import xarray as xr
import hvplot.xarray
import datetime as dt
import pandas as pd
import dask
import panel as pn
import json
import fsspec
from kerchunk.grib2 import scan_grib
from kerchunk.combine import MultiZarrToZarr

and

so = {"anon": True, "default_cache_type": "readahead"}
outscan = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = ['time', 'step', 'latitude', 'longitude', 'valid_time'], storage_options=so )

it takes a second, and outscan looks like:

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:03 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpq40ghn6agrib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',
  'time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<i8",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'time/0': 'base64:gENtOAAAAAA=',
  'time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "initial time of forecast",\n    "standard_name": "forecast_reference_time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',
  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',
  'longitude/0': ['{{u}}', 0, 325317],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'valid_time/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'valid_time/0': 'base64:AAAA2LY2zEE=',
  'valid_time/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "calendar": "proleptic_gregorian",\n    "long_name": "time",\n    "standard_name": "time",\n    "units": "seconds since 1970-01-01T00:00:00"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

Note the part regarding step:

 'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

what you notice is that these are all empty:

  • “chunks”: [],
  • “shape”: [],
  • “filters”: null,\n
  • ‘step/0’: ‘\x00\x00\x00\x00\x00\x00\x08@’,
  • “_ARRAY_DIMENSIONS”: []

differently, for latitude or longitude:

 'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": 

you have:

  • “chunks”: [\n 721\n ],\n
  • “filters”: [\n {\n “id”: “grib”,\n “var”: “latitude”\n }\n ],\n
  • “shape”: [\n 721\n ],
  • ‘latitude/0’: [‘{{u}}’, 0, 325317],
  • “_ARRAY_DIMENSIONS”: [\n “latitude”\n ],\n

Ok - so I went and looked into grib2.py and I isolated the part where the reading of the variables defined as common_vars happens (link):

if common is False:
    # done for first valid message
    logger.debug("Common variables")
    z.attrs.update(ds.attributes)
    for var in common_vars:
        # assume grid, etc is the same across all messages
        attr = ds.variables[var].attributes or {}
        attr['_ARRAY_DIMENSIONS'] = ds.variables[var].dimensions
        _store_array(store, z, ds.variables[var].data, var, inline_threashold, offset, size,
                     attr)

If I simply do as below (for simplicity I limit common_vars to two, step and latitude):

dscf = cfgrib.open_file('acpcp_sfc_2000010100_c00.grib2')
for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
    print(var)
    print(dscf.variables[var].attributes)
    print(dscf.variables[var].dimensions)
    print(dscf.variables[var].data)

I get:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
('step',)
[  3.   6.   9.  12.  15.  18.  21.  24.  27.  30.  33.  36.  39.  42.
        ...condensed..for...clarity
 213. 216. 219. 222. 225. 228. 231. 234. 237. 240.]
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

So the problem is NOT the grib2 file (which often times can be). If I copy and paste from grib2.py the relevant code to run _split_file

import tempfile
def _split_file(f, skip=0):
    if hasattr(f, "size"):
        size = f.size
    else:
        f.seek(0, 2)
        size = f.tell()
        f.seek(0)
    part = 0

    while f.tell() < size:
        # logger.debug(f"extract part {part + 1}")
        start = f.tell()
        f.seek(12, 1)
        part_size = int.from_bytes(f.read(4), "big")
        f.seek(start)
        data = f.read(part_size)
        assert data[:4] == b"GRIB"
        assert data[-4:] == b"7777"
        fn = tempfile.mktemp(suffix="grib2")
        with open(fn, "wb") as fo:
            fo.write(data)
        yield fn, start, part_size
        part += 1
        if skip and part > skip:
            break

and run:

import cfgrib
url = 'acpcp_sfc_2000010100_c00.grib2'
with fsspec.open(url, "rb") as f:
    for fn, offset, size in _split_file(f, skip=0):
        dscf = cfgrib.open_file(fn)
        for var in ['step', 'latitude', ]:
                        # assume grid, etc is the same across all messages
            print(var)
            print(dscf.variables[var].attributes)
            print(dscf.variables[var].dimensions)
            print(dscf.variables[var].data)

what I get is the same as in outscan above BUT I get one print out for each instance of step:

step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
3.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
6.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -90.  ]
step
{'long_name': 'time since forecast_reference_time', 'units': 'hours', 'standard_name': 'forecast_period'}
()
9.0
latitude
{'units': 'degrees_north', 'standard_name': 'latitude', 'long_name': 'latitude', 'stored_direction': 'decreasing'}
('latitude',)
[ 90.    89.75  89.5   89.25  89.    88.75  88.5   88.25  88.    87.75
        ...condensed..for...clarity
 -87.5  -87.75 -88.   -88.25 -88.5  -88.75 -89.   -89.25 -89.5  -89.75
 -90.  ]

however, once again, when I ran scan_grib (let’s use only step and latitude as common vars):

so = {"anon": True, "default_cache_type": "readahead"}
outscansl = scan_grib('acpcp_sfc_2000010100_c00.grib2',
                common_vars = [ 'step', 'latitude', ], storage_options=so )

it will take a while (almost as it is indeed going through all the 80 steps of step) but the output has only the first one:

outscansl

{'version': 1,
 'refs': {'.zgroup': '{\n    "zarr_format": 2\n}',
  '.zattrs': '{\n    "Conventions": "CF-1.7",\n    "GRIB_centre": "kwbc",\n    "GRIB_centreDescription": "US National Weather Service - NCEP",\n    "GRIB_edition": 2,\n    "GRIB_subCentre": 2,\n    "history": "2022-04-14T19:23 GRIB to CDM+CF via cfgrib-0.9.10.0/ecCodes-2.25.0 with {\\"source\\": \\"../../tmp/tmpyycaxsp5grib2\\", \\"filter_by_keys\\": {}, \\"encode_cf\\": [\\"parameter\\", \\"time\\", \\"geography\\", \\"vertical\\"]}",\n    "institution": "US National Weather Service - NCEP"\n}',

  'step/.zarray': '{\n    "chunks": [],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": null,\n    "order": "C",\n    "shape": [],\n    "zarr_format": 2\n}',
  'step/0': '\x00\x00\x00\x00\x00\x00\x08@',
  'step/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [],\n    "long_name": "time since forecast_reference_time",\n    "standard_name": "forecast_period",\n    "units": "hours"\n}',

  'latitude/.zarray': '{\n    "chunks": [\n        721\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "latitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721\n    ],\n    "zarr_format": 2\n}',
  'latitude/0': ['{{u}}', 0, 325317],
  'latitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "latitude"\n    ],\n    "long_name": "latitude",\n    "standard_name": "latitude",\n    "stored_direction": "decreasing",\n    "units": "degrees_north"\n}',
  'longitude/.zarray': '{\n    "chunks": [\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f8",\n    "fill_value": 0.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "longitude"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        1440\n    ],\n    "zarr_format": 2\n}',

  'longitude/0': ['{{u}}', 29740164, 413329],
  'longitude/.zattrs': '{\n    "_ARRAY_DIMENSIONS": [\n        "longitude"\n    ],\n    "long_name": "longitude",\n    "standard_name": "longitude",\n    "units": "degrees_east"\n}',
  'acpcp/.zarray': '{\n    "chunks": [\n        721,\n        1440\n    ],\n    "compressor": null,\n    "dtype": "<f4",\n    "fill_value": 9999.0,\n    "filters": [\n        {\n            "id": "grib",\n            "var": "acpcp"\n        }\n    ],\n    "order": "C",\n    "shape": [\n        721,\n        1440\n    ],\n    "zarr_format": 2\n}',
  'acpcp/0.0': ['{{u}}', 29740164, 413329],
  'acpcp/.zattrs': '{\n    "GRIB_NV": 0,\n    "GRIB_Nx": 1440,\n    "GRIB_Ny": 721,\n    "GRIB_cfName": "lwe_thickness_of_convective_precipitation_amount",\n    "GRIB_cfVarName": "acpcp",\n    "GRIB_dataType": "cf",\n    "GRIB_gridDefinitionDescription": "Latitude/longitude. Also called equidistant cylindrical, or Plate Carree",\n    "GRIB_gridType": "regular_ll",\n    "GRIB_iDirectionIncrementInDegrees": 0.25,\n    "GRIB_iScansNegatively": 0,\n    "GRIB_jDirectionIncrementInDegrees": 0.25,\n    "GRIB_jPointsAreConsecutive": 0,\n    "GRIB_jScansPositively": 0,\n    "GRIB_latitudeOfFirstGridPointInDegrees": 90.0,\n    "GRIB_latitudeOfLastGridPointInDegrees": -90.0,\n    "GRIB_longitudeOfFirstGridPointInDegrees": 0.0,\n    "GRIB_longitudeOfLastGridPointInDegrees": 359.75,\n    "GRIB_missingValue": 9999,\n    "GRIB_name": "Convective precipitation (water)",\n    "GRIB_numberOfPoints": 1038240,\n    "GRIB_paramId": 3063,\n    "GRIB_shortName": "acpcp",\n    "GRIB_stepType": "accum",\n    "GRIB_stepUnits": 1,\n    "GRIB_totalNumber": 10,\n    "GRIB_typeOfLevel": "surface",\n    "GRIB_units": "kg m**-2",\n    "_ARRAY_DIMENSIONS": [\n        "latitude",\n        "longitude"\n    ],\n    "coordinates": "number time step surface latitude longitude valid_time",\n    "long_name": "Convective precipitation (water)",\n    "standard_name": "lwe_thickness_of_convective_precipitation_amount",\n    "units": "kg m**-2"\n}'},
 'templates': {'u': 'acpcp_sfc_2000010100_c00.grib2'}}

so I thought that the issue was _store_array that was overwriting things, but if I copy and paste the whole grib2.py file and add some print statemetns, it seems like _store_array prints only the first step value (which makes sense since it’s what i have in the outscan variable) although it is within the loop of for var in common_vars: 🤷🏼‍♀️

Bottom line, I have no idea where the issue is. Clearly, the _split_file decides to generate one “message” per step (why? why not latitude? the order of common_vars doesn’t matter I checked), but then it looses those parts somehow.

Sorry for the long issue, but I wanted to be as detailed as possible. Thanks for your time, Chiara

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Apr 14, 2022

This kinda works, maybe…

A bunch of hacky code to parse the .index file

!pip install -q git+https://github.com/TomAugspurger/cogrib

import adlfs
import cogrib
import urllib.request
import cfgrib
import requests
import re
import itertools

def pairwise(iterable):
    # pairwise('ABCDEFG') --> AB BC CD DE EF FG
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)

# XXX: No idea why this is necessary,
# but the inventory file uses different names than cfgrib / xarray.
PARAM_VARAIBLE_MAPPING = {
    "tcdc": "tcc",
    "tmp": "t",
    "soilm": "ssw",
}
# 8:338750:d=2022022100:TMP:0-0.1 m below ground:anl:
def parse_index(index_text: str, length):
    lines = index_text.split("\n")
    names = ["row", "_offset", "date", "variable", "type_of_level", "thing"]
    records = [
        dict(zip(names, line.split(":")[:-1]))
        for line in lines
        if line
    ]

    range_xpr = re.compile(r"(\d*\.?\d*)-(\d*\.?\d*)")

    for record in records:
        record["_offset"] = int(record["_offset"])
        record["param"] = record.pop("variable").lower()  # is this right?
        # XXX: why does this name differ?
        if record["param"] in PARAM_VARAIBLE_MAPPING:
            record["param"] = PARAM_VARAIBLE_MAPPING[record["param"]]
            
        m = range_xpr.match(record["type_of_level"])
        if m:
            endpoints = list(map(float, m.groups()))
            record["levelist"] = endpoints[0]
        else:
            levelist = None
        start, stop = record["thing"].split(" ")[0].split("-")
        record["step"] = stop
            
    for a, b in pairwise(records):
        a["_length"] = b["_offset"] - a["_offset"]
    records[-1]["_length"] = length - records[-1]["_offset"]
    return records

Create the references

import urllib.request
import requests


url = 'https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2000/2000010100/c00/Days:1-10/acpcp_sfc_2000010100_c00.grib2'
filename, message = urllib.request.urlretrieve(url)
r = requests.get(url + ".idx")
indices = parse_index(r.text, int(message["Content-Length"]))

dss = cfgrib.open_datasets(filename, backend_kwargs=dict(indexpath=""))
references = cogrib.make_references(dss[0], indices, url)

Use them with a reference file system

import fsspec
import xarray as xr

ds = xr.open_dataset(
    fsspec.get_mapper("reference://", fo=references),
    engine="zarr",
    chunks={}
)
ds

which outputs

<xarray.Dataset>
Dimensions:     (step: 80, latitude: 721, longitude: 1440)
Coordinates:
  * latitude    (latitude) float64 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude   (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
    number      int64 ...
  * step        (step) timedelta64[ns] 0 days 03:00:00 ... 10 days 00:00:00
    surface     float64 ...
    time        datetime64[ns] ...
    valid_time  (step) datetime64[ns] dask.array<chunksize=(80,), meta=np.ndarray>
Data variables:
    acpcp       (step, latitude, longitude) float32 dask.array<chunksize=(1, 721, 1440), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_edition:            2
    GRIB_subCentre:          2
    institution:             US National Weather Service - NCEP

Note that this doesn’t use kerchunk.grib2.scan_grib. But it should output an fsspec-compatible output.

0reactions
chiaralcommented, Apr 14, 2022

Thanks you both!

I am not surprised that this grib needs different stuff compared to standard cfgrib usage. Unfortunately the NOAA generated grib files can be very very different from the ecmwf ones. It’s one of the biggest issues with grib.

I will take a look at cogrib! thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

DETAILING AND - Scangrip
SCANGRIP is constantly pushing back the boundaries of LED lighting and battery technology, creating new and specialized work light solutions.
Read more >
MATCHPEN penlight with focus function ... - Scangrip
MATCHPEN is a slim and small penlight with a powerful light output perfect for car detailing and detecting any small imperfections.
Read more >
i-match 3 - Auto Paint and Detailing Lights - Scangrip
Rechargeable and superior headlamp for any detailing job. Supplied with the newest COB LED technology, the rechargeable I-MATCH 3 headlamp provides an ...
Read more >
sunmatch 4 - Auto Paint and Detailing Lights - Scangrip
Compact LED work light with new CCT SCAN function for car detailing and colour match. SUNMATCH 4 is a rechargeable and handy LED...
Read more >
US VERSION - Scangrip
No heat is produced on the panel or the paint ... Allows curing of plastic parts or any heat ... solution to fit...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found