question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

handle files with different scale_factor and add_offset

See original GitHub issue

It would be nice to handle netcdf files with different scale_factor and add_offset.

I recently used the ECMWF API to generate a bunch of netcdf3 files (in parallel, using dask!) but unfortunately the generated files all have different scale_factor and add_offset.

Here are four files that we would like to virtually aggregate with kerchunk:

14:30 $ aws s3 ls s3://rsignellbucket1/era5_land/ --endpoint https://mghp.osn.xsede.org --no-sign-request
2022-08-18 14:28:56  192215564 conus_2019-12-01.nc
2022-08-18 14:28:58  192215560 conus_2019-12-15.nc
2022-08-18 14:28:58  192215560 conus_2019-12-29.nc
2022-08-18 14:28:59  192215560 conus_2020-01-12.nc

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mikejohnson51commented, Aug 23, 2022

Hi all - @dblodgett-usgs asked me to consider this from an R perspective. For this type of task (lazy read, access, manipulation) I rely on terra and its GDAL bindings.

The TL/DR is that R doesn’t have good (any?) zarr support but does tackle the above challenges pretty well for NetCDF/COG/TIF/STAC/THREDDS type data. I believe that GDAL handles the unpacking of data so the example below of aggregations would apply to that.

Below are some brief examples using the above datasets and the public NWM forcings. The workflow relies on appending the vsis3 (or vsicurl) prefixes to data URLs.

ERA5-land data

# load libraries
library(terra);library(dplyr)
# Here is the link to an s3 file appended with the GDAL vsis3 prefix
url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'

# Lazy Read
(d = rast(url))
#> Warning: [rast] unknown extent

At this stage we don’t have data, just structure.

With it we can look at layer names, dimensions and spatial attributes:

#> class       : SpatRaster 
#> dimensions  : 721, 1440, 744  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 1, 0, 1  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : air_pressure_at_mean_sea_level.nc 
#> names       : air_p~vel_1, air_p~vel_2, air_p~vel_3, air_p~vel_4, air_p~vel_5, air_p~vel_6, ...

# Look at layer count and names
nlyr(d)
#> [1] 744

# Look at "third" diminsion
(head(names(d)))
#> [1] "air_pressure_at_mean_sea_level_1" "air_pressure_at_mean_sea_level_2"
#> [3] "air_pressure_at_mean_sea_level_3" "air_pressure_at_mean_sea_level_4"
#> [5] "air_pressure_at_mean_sea_level_5" "air_pressure_at_mean_sea_level_6"

# Look at spatial metadata
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Slicing through “time” by layer number or name

Data is only fetched on the call to plot

plot(d[[3]])

With no spatial information we can only subset on the unit grid

Here data is only fetched for the lower left quadrent on the call to crop

cr = crop(d[[50]], ext(0,.5,0,.5)) 

#> class       : SpatRaster 
#> dimensions  : 361, 720, 1  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 0.5, 0, 0.5006935  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : memory 
#> name        : air_pressure_at_mean_sea_level_50 
#> min value   :                          96877.31 
#> max value   :                         103217.8

plot(cr)

We can add the spatial metadata if known:

If we do so, then we can subset across multiple dimensions - say for the state of California at interval “40”

ext(d) = c(-180,180,-90,90)
crs(d) = "EPSG:4326"

AOI = AOI::aoi_get(state = "CA") %>% 
  vect() %>% 
  project(crs(d))

plot(crop(d[[40]], AOI))
plot(AOI, add = TRUE)

Aggregating Files

Lets say we have 2 files for January and February of 2020 that we want to read as a collections. We can define those urls, and treat them in aggregate.

Remembering that the first of those had 744 layers above, we will read them together and plot layers 744-745:

url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'
url2 = '/vsis3/era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc'

xx = rast(c(url, url2))
#> Warning: [rast] unknown extent
#> unknown extent

nlyr(xx)
#> [1] 1440

plot(xx[[744:745]])

Created on 2022-08-22 by the reprex package (v2.0.1)

NWM forcing example:

Start with a url endpoint and open a connection:

## URL
u = '/vsis3/noaa-nwm-retrospective-2-1-pds/forcing/1979/197902010000.LDASIN_DOMAIN1'

## Lazy Read
d = rast(u)
#> Warning: [rast] unknown extent

Explore layer slices and spatial properties

nlyr(d)
#> [1] 9
(names(d))
#> [1] "LQFRAC"   "LWDOWN"   "PSFC"     "Q2D"      "RAINRATE" "SWDOWN"   "T2D"     
#> [8] "U2D"      "V2D"
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Assign missing spatial metadata:

ext(d) = ext(
  -2303999.62876143,
  2304000.37123857,
  -1920000.70008381,
  1919999.29991619
)
crs(d) = 'PROJCS["Sphere_Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DATUM["D_Sphere",SPHEROID["Sphere",6370000.0,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["false_easting",0.0],PARAMETER["false_northing",0.0],PARAMETER["central_meridian",-97.0],PARAMETER["standard_parallel_1",30.0],PARAMETER["standard_parallel_2",60.0],PARAMETER["latitude_of_origin",40.000008],UNIT["Meter",1.0]];-35691800 -29075200 126180232.640845;-100000 10000;-100000 10000;0.001;0.001;0.001;IsHighPrecision'


# Plot (fetch data)
plot(d$LWDOWN)

Thats not right!

The data is stored in reverse (topdown in GDAL) -so we can flip it and plot

plot(flip(d$LWDOWN))

plot(flip(crop(d$RAINRATE, project(AOI, crs(d)))))

Screen Shot 2022-08-22 at 7 02 49 PM

So far, we have wrapped a lot of this logic in the package opendap.catalog (name changing soon) to facilitate the discovery of data files, the automation of metadata generation, and the subsetting/aggregating of files.

We have approached this by developing a catalog (https://mikejohnson51.github.io/climateR-catalogs/catalog.json) that stores the spatiotemporal and attribute metadata of discoverable assets and then helps overcome the oddities of url defintion, applying the right spatial metadata, implementing transformations (e.g. flip) and slicing based on space and time. So far its proven to be a very useful pattern for our team.

If anyone would like to talk about this kind of stuff I would love it!

Thanks,

Mike

Created on 2022-08-22 by the reprex package (v2.0.1)

0reactions
martindurantcommented, Oct 26, 2022

An alternative way to achieve this occurred to me. If we allow for preffs’s model of multiple references per key, to be read and concatenated on load, then we can add any absolute value into the references set:

{
   "variable/0.1.0": [b'\x00\x00\x80?\x00\x00\x00\x00', [url, offset, size]],
   ...
}

for scale 1, offset 0 as float32. The tricky part is, the concatenated buffer would normally be passed to a decompressor before the scale/offset codecs - but not for the case of netCDF3 (which is uncompressed internally). So all we have to d in this case is make a modified scale/offset codec that takes its params from the buffer, something like what the JSON codec does.

For the general case where there is compression, we would need to make a compression wrapper codec that extracts the values, decompresses the rest, and passed the values back as a tuple (or remakes a combined buffer).

@rabernat , this looks a lot like your idea of including extra per-chunk information in the storage layer metadata

Read more comments on GitHub >

github_iconTop Results From Across the Web

scale factor and offset when appending to variable · Issue #274
I assumed that I have to write the packed data into the file and the scale factor and add_offset are only applied when...
Read more >
NetCDF User's Guide - Attribute Conventions
When scaled data are written, the application should first subtract the offset and then divide by the scale factor. When scale_factor and add_offset...
Read more >
Example to use scale_factor and add_offset in netCDF4 with ...
If you want to know how to use add_offset and scale_factor parameters to pack or unpack data in .nc file, you can read...
Read more >
Issue with ncread when _FillValue, add_offset, and ...
I have run into an issue when trying to concatenate multiple netcdf files with the same variables into one file, combining these variables....
Read more >
Scale factor and offset problem - Unidata
But when I want to write variables into another netcdf file, ... the original variable with the scale_factor and add_offset attributes.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found