Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

We need a fast path for open_mfdataset

See original GitHub issue

It would be great to have a “fast path” option for open_mfdataset, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.

Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.

This is also related to #1385.

Issue Analytics

State:
Created 6 years ago
Reactions:9
Comments:19 (19 by maintainers)

Top GitHub Comments

3reactions

dcheriancommented, Sep 16, 2019

PS @rabernat

%%time
ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", 
                        parallel=True, coords="minimal", data_vars="minimal", compat='override')

This completes in 40 seconds with 10 workers on cheyenne.

3reactions

jhammancommented, Jan 12, 2018

@rabernat - Depending on the structure of the dataset, another possibility that would speed up some open_mfdataset tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work with autoclose=True but it could be quite useful when working with many files.