Regridding API design
See original GitHub issueI was looking at Pangeo’s Regridding Design Document, which suggests an interface like da.remap.remap_like(da_target, how='bilinear')
. It looks clean but doesn’t match the “two-step” procedure for most regridding algorithms.
Regridding weight calculation and weight application should be done separately, as reviewed by #2. The weights only depend on the source and target grids, not the input data. As long as the grids are not changed, users only need to calculate the weights once and can apply them to any data. Applying weights is often orders of magnitude faster than calculating weights (see the timing in #6 as an example), so separating the two steps has a huge impact on performance, especially when the task is to regrid a lot of data between a fixed pair of grids.
xESMF’s current design that dr_out = xe.regrid(ds_in, ds_out, dr_in)
basically re-computes the weights every time. Using two steps should boost the performance by 10~100x.
Thus I am thinking about sklearn-like API, which is also two-step:
-
In sklearn you train a model by
model.fit(x_train, y_train)
-
then make new predictions by
y_pred = model.predict(x_test)
xESMF can do similar things:
-
Calculate regridding weights by
weights = xe.compute_weights(ds_in, ds_out, method='bilinear')
whereds_in
andds_out
are xarray DataSet containing input and output grid information. -
then apply weights to data by
dr_out = weights.apply(dr_in)
wheredr_in
is the input DataArray -
Because ESMPy writes the weights to a file, the next time you can read it from file instead of computing it again.
weights = xe.read_weights("weights.nc")
The IO time is negligible compared to re-computing the weights.
Here weights
is a tiny class that holds the weights and knows how to apply them to data. Alternatively, weights can be simply an xarray DataSet read by raw_weights = xr.open_dataset("weights.nc")
. Then step 2 is changed to dr_out = xe.apply_weights(raw_weights, dr_in)
. I prefer the first approach because it feels more like sklearn and people might feel it more familiar.
Any comments are welcome, otherwise I’ll proceed this way. @rabernat @jhamman @spencerahill. Please also @ anyone who are interested in regridding with xarray.
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
I think having a step that can compute and cache the weights is a great idea, and very much like the sklearn-style approach. There are a few standard formats that I’m aware of for caching weights - in particular, SCRIP uses a format that both NCL and CDO are able to take in turn.
Now, if everything is done in the framework of xgcm, and there’s some notion of “standard” grids for a set of known models, then it may be possible to create a data server/archive which caches the weights for well-known re-gridding operations. Think about how Cartopy has the nifty utility to grab shapefiles from NaturalEarth… what if as part of pangeo-data there was a bucket on EC2 or GCP that catalogued and archived these re-gridding weights and could be directly downloaded by a client? The whole catalogue could be entirely automated, or could even receive requests and create the re-gridding weights on a VM that AWS or GCP spins up if the archive receives a request for an unknown re-gridding operation.
On this topic, we are discussing how to best implement cell distance / area / volume data (generically called “grid metrics”) in xgcm/xgcm#81. One possibility is that xgcm will take care of those geometric questions. In this case, a regridding package could map between xgcm grids, rather than xarray datasets.