slow performance with open_mfdataset
See original GitHub issueWe have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.
Each individual netCDF file looks like this:
%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single
CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 1)
Coordinates:
* time (time) datetime64[ns] 1993-01-01
* npart (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
z (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
u (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
v (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
y (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...
As shown above, a single data file opens in ~60 ms.
When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:
%time ds = xr.open_mfdataset('*.nc', )
ds
CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s
<xarray.Dataset>
Dimensions: (npart: 8192000, time: 49)
Coordinates:
* npart (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
* time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
z (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
vort (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
u (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
v (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
x (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
y (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...
It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.
Here is %prun of the open_mfdataset command.
748994 function calls (724222 primitive calls) in 142.160 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
49 62.455 1.275 62.458 1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
49 47.207 0.963 47.209 0.963 base.py:1067(is_unique)
196 7.198 0.037 7.267 0.037 {operator.getitem}
49 4.632 0.095 4.687 0.096 netCDF4_.py:182(_open_netcdf4_group)
240 3.189 0.013 3.426 0.014 numeric.py:2476(array_equal)
98 1.937 0.020 1.937 0.020 {numpy.core.multiarray.arange}
4175/3146 1.867 0.000 9.296 0.003 {numpy.core.multiarray.array}
49 1.525 0.031 119.144 2.432 alignment.py:251(reindex_variables)
24 1.065 0.044 1.065 0.044 {method 'cumsum' of 'numpy.ndarray' objects}
12 1.010 0.084 1.010 0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035 0.660 0.000 1.688 0.000 collections.py:50(__init__)
12 0.600 0.050 3.238 0.270 core.py:2761(insert)
12691/7497 0.473 0.000 0.875 0.000 indexing.py:363(shape)
110728 0.425 0.000 0.663 0.000 {isinstance}
12 0.413 0.034 0.413 0.034 {method 'flatten' of 'numpy.ndarray' objects}
12 0.341 0.028 0.341 0.028 {numpy.core.multiarray.where}
2 0.333 0.166 0.333 0.166 {pandas._join.outer_join_indexer_int64}
1 0.331 0.331 142.164 142.164 <string>:1(<module>)
It looks like most of the time is being spent on reindex_variables. I understand why this happens…xarray needs to make sure the dimensions are the same in order to concatenate them together.
Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?
Issue Analytics
- State:
- Created 6 years ago
- Comments:39 (21 by maintainers)

Top Related StackOverflow Question
In your twitter thread you said
The general reason for this is usually that
open_mfdatasetperforms coordinate compatibility checks when it concatenates the files. It’s useful to actually read the code of open_mfdataset to see how it works.First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903
You can recreate this step outside of xarray yourself by doing something like
Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952
You can reproduce this step outside of xarray, e.g.
At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.
Without seeing more details about your files, it’s hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.
If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for
open_mfdataset, such ascoords='minimal', compat='override', etc.Once you post your file details, we can provide more concrete suggestions.
OK, so it seems that we need a change to disable wrapping dask arrays with
LazilyIndexedArray. Dask arrays are already lazy!