question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

We need a fast path for open_mfdataset

See original GitHub issue

It would be great to have a “fast path” option for open_mfdataset, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.

Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.

This is also related to #1385.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:9
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

3reactions
dcheriancommented, Sep 16, 2019

PS @rabernat

%%time
ds = xr.open_mfdataset("/glade/p/cesm/community/ASD-HIGH-RES-CESM1/hybrid_v5_rel04_BC5_ne120_t12_pop62/ocn/proc/tseries/monthly/*.nc", 
                        parallel=True, coords="minimal", data_vars="minimal", compat='override')

This completes in 40 seconds with 10 workers on cheyenne.

3reactions
jhammancommented, Jan 12, 2018

@rabernat - Depending on the structure of the dataset, another possibility that would speed up some open_mfdataset tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work with autoclose=True but it could be quite useful when working with many files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

xarray.open_mfdataset
When you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray...
Read more >
Ways to speed up open_mfdataset in xarray - Stack Overflow
Is there any way to speed it up? I know there is the "chunks" argument, but my understanding is that that is really...
Read more >
xarray.open_mfdataset — xarray 0.8.2 documentation
Open multiple files as a single dataset. Experimental. Requires dask to be installed. Parameters: paths : str or sequence. Either a ...
Read more >
Speeding up access to large datasets with Dask Delayed
The xarray library has been great for accessing datasets and their metadata, but one thing I have noticed is that it can be...
Read more >
Using grib2 files with `open_mfdataset`: is there a better ...
I 'm working on a project using a particular dataset where each time step is stored as a separate file. I want all...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found