handle files with different scale_factor and add_offset
See original GitHub issueIt would be nice to handle netcdf files with different scale_factor
and add_offset
.
I recently used the ECMWF API to generate a bunch of netcdf3 files (in parallel, using dask!) but unfortunately the generated files all have different scale_factor
and add_offset
.
Here are four files that we would like to virtually aggregate with kerchunk:
14:30 $ aws s3 ls s3://rsignellbucket1/era5_land/ --endpoint https://mghp.osn.xsede.org --no-sign-request
2022-08-18 14:28:56 192215564 conus_2019-12-01.nc
2022-08-18 14:28:58 192215560 conus_2019-12-15.nc
2022-08-18 14:28:58 192215560 conus_2019-12-29.nc
2022-08-18 14:28:59 192215560 conus_2020-01-12.nc
Issue Analytics
- State:
- Created a year ago
- Comments:14 (4 by maintainers)
Top Results From Across the Web
scale factor and offset when appending to variable · Issue #274
I assumed that I have to write the packed data into the file and the scale factor and add_offset are only applied when...
Read more >NetCDF User's Guide - Attribute Conventions
When scaled data are written, the application should first subtract the offset and then divide by the scale factor. When scale_factor and add_offset...
Read more >Example to use scale_factor and add_offset in netCDF4 with ...
If you want to know how to use add_offset and scale_factor parameters to pack or unpack data in .nc file, you can read...
Read more >Issue with ncread when _FillValue, add_offset, and ...
I have run into an issue when trying to concatenate multiple netcdf files with the same variables into one file, combining these variables....
Read more >Scale factor and offset problem - Unidata
But when I want to write variables into another netcdf file, ... the original variable with the scale_factor and add_offset attributes.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi all - @dblodgett-usgs asked me to consider this from an R perspective. For this type of task (lazy read, access, manipulation) I rely on
terra
and its GDAL bindings.The TL/DR is that R doesn’t have good (any?) zarr support but does tackle the above challenges pretty well for NetCDF/COG/TIF/STAC/THREDDS type data. I believe that GDAL handles the unpacking of data so the example below of aggregations would apply to that.
Below are some brief examples using the above datasets and the public NWM forcings. The workflow relies on appending the vsis3 (or vsicurl) prefixes to data URLs.
ERA5-land data
At this stage we don’t have data, just structure.
With it we can look at layer names, dimensions and spatial attributes:
Slicing through “time” by layer number or name
Data is only fetched on the call to
plot
With no spatial information we can only subset on the unit grid
Here data is only fetched for the lower left quadrent on the call to
crop
We can add the spatial metadata if known:
If we do so, then we can subset across multiple dimensions - say for the state of California at interval “40”
Aggregating Files
Lets say we have 2 files for January and February of 2020 that we want to read as a collections. We can define those urls, and treat them in aggregate.
Remembering that the first of those had 744 layers above, we will read them together and plot layers 744-745:
Created on 2022-08-22 by the reprex package (v2.0.1)
NWM forcing example:
Start with a url endpoint and open a connection:
Explore layer slices and spatial properties
Assign missing spatial metadata:
Thats not right!
The data is stored in reverse (topdown in GDAL) -so we can flip it and plot
So far, we have wrapped a lot of this logic in the package
opendap.catalog
(name changing soon) to facilitate the discovery of data files, the automation of metadata generation, and the subsetting/aggregating of files.We have approached this by developing a catalog (https://mikejohnson51.github.io/climateR-catalogs/catalog.json) that stores the spatiotemporal and attribute metadata of discoverable assets and then helps overcome the oddities of url defintion, applying the right spatial metadata, implementing transformations (e.g. flip) and slicing based on space and time. So far its proven to be a very useful pattern for our team.
If anyone would like to talk about this kind of stuff I would love it!
Thanks,
Mike
Created on 2022-08-22 by the reprex package (v2.0.1)
An alternative way to achieve this occurred to me. If we allow for preffs’s model of multiple references per key, to be read and concatenated on load, then we can add any absolute value into the references set:
for scale 1, offset 0 as float32. The tricky part is, the concatenated buffer would normally be passed to a decompressor before the scale/offset codecs - but not for the case of netCDF3 (which is uncompressed internally). So all we have to d in this case is make a modified scale/offset codec that takes its params from the buffer, something like what the JSON codec does.
For the general case where there is compression, we would need to make a compression wrapper codec that extracts the values, decompresses the rest, and passed the values back as a tuple (or remakes a combined buffer).
@rabernat , this looks a lot like your idea of including extra per-chunk information in the storage layer metadata