We need a fast path for open_mfdataset
See original GitHub issueIt would be great to have a “fast path” option for open_mfdataset
, in which all alignment / coordinate checking is bypassed. This would be used in cases where the user knows that many netCDF files all share the same coordinates (e.g. model output, satellite records from the same product, etc.). The coordinates would just be taken from the first file, and only the data variables would be read from all subsequent files. The only checking would be that the data variables have the correct shape.
Implementing this would require some refactoring. @jbusecke mentioned that he had developed a solution for this (related to #1704), so maybe he could be the one to add this feature to xarray.
This is also related to #1385.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:9
- Comments:19 (19 by maintainers)
Top Results From Across the Web
xarray.open_mfdataset
When you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray...
Read more >Ways to speed up open_mfdataset in xarray - Stack Overflow
Is there any way to speed it up? I know there is the "chunks" argument, but my understanding is that that is really...
Read more >xarray.open_mfdataset — xarray 0.8.2 documentation
Open multiple files as a single dataset. Experimental. Requires dask to be installed. Parameters: paths : str or sequence. Either a ...
Read more >Speeding up access to large datasets with Dask Delayed
The xarray library has been great for accessing datasets and their metadata, but one thing I have noticed is that it can be...
Read more >Using grib2 files with `open_mfdataset`: is there a better ...
I 'm working on a project using a particular dataset where each time step is stored as a separate file. I want all...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
PS @rabernat
This completes in 40 seconds with 10 workers on cheyenne.
@rabernat - Depending on the structure of the dataset, another possibility that would speed up some
open_mfdataset
tasks substantially is to implement the step of opening each file and getting its metadata in in some parallel way (dask/joblib/etc.) and either returning the just dataset schema or a picklable version of the dataset itself. I think this will only be able to work withautoclose=True
but it could be quite useful when working with many files.