question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow performance with open_mfdataset

See original GitHub issue

We have a dataset stored across multiple netCDF files. We are getting very slow performance with open_mfdataset, and I would like to improve this.

Each individual netCDF file looks like this:

%time ds_single = xr.open_dataset('float_trajectories.0000000000.nc')
ds_single
CPU times: user 14.9 ms, sys: 48.4 ms, total: 63.4 ms
Wall time: 60.8 ms

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 1993-01-01
  * npart    (npart) int32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    z        (time, npart) float32 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float32 -9.71733e-10 -9.72858e-10 -9.73001e-10 ...
    u        (time, npart) float32 0.000545563 0.000544884 0.000544204 ...
    v        (time, npart) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float32 180.016 180.047 180.078 180.109 180.141 ...
    y        (time, npart) float32 -79.9844 -79.9844 -79.9844 -79.9844 ...

As shown above, a single data file opens in ~60 ms.

When I call open_mdsdataset on 49 files (each with a different time dimension but the same npart), here is what happens:

%time ds = xr.open_mfdataset('*.nc', )
ds
CPU times: user 1min 31s, sys: 25.4 s, total: 1min 57s
Wall time: 2min 4s

<xarray.Dataset>
Dimensions:  (npart: 8192000, time: 49)
Coordinates:
  * npart    (npart) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * time     (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ...
Data variables:
    z        (time, npart) float64 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ...
    vort     (time, npart) float64 -9.717e-10 -9.729e-10 -9.73e-10 -9.73e-10 ...
    u        (time, npart) float64 0.0005456 0.0005449 0.0005442 0.0005437 ...
    v        (time, npart) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    x        (time, npart) float64 180.0 180.0 180.1 180.1 180.1 180.2 180.2 ...
    y        (time, npart) float64 -79.98 -79.98 -79.98 -79.98 -79.98 -79.98 ...

It takes over 2 minutes to open the dataset. Specifying concat_dim='time' does not improve performance.

Here is %prun of the open_mfdataset command.

         748994 function calls (724222 primitive calls) in 142.160 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       49   62.455    1.275   62.458    1.275 {method 'get_indexer' of 'pandas.index.IndexEngine' objects}
       49   47.207    0.963   47.209    0.963 base.py:1067(is_unique)
      196    7.198    0.037    7.267    0.037 {operator.getitem}
       49    4.632    0.095    4.687    0.096 netCDF4_.py:182(_open_netcdf4_group)
      240    3.189    0.013    3.426    0.014 numeric.py:2476(array_equal)
       98    1.937    0.020    1.937    0.020 {numpy.core.multiarray.arange}
4175/3146    1.867    0.000    9.296    0.003 {numpy.core.multiarray.array}
       49    1.525    0.031  119.144    2.432 alignment.py:251(reindex_variables)
       24    1.065    0.044    1.065    0.044 {method 'cumsum' of 'numpy.ndarray' objects}
       12    1.010    0.084    1.010    0.084 {method 'sort' of 'numpy.ndarray' objects}
5227/4035    0.660    0.000    1.688    0.000 collections.py:50(__init__)
       12    0.600    0.050    3.238    0.270 core.py:2761(insert)
12691/7497    0.473    0.000    0.875    0.000 indexing.py:363(shape)
   110728    0.425    0.000    0.663    0.000 {isinstance}
       12    0.413    0.034    0.413    0.034 {method 'flatten' of 'numpy.ndarray' objects}
       12    0.341    0.028    0.341    0.028 {numpy.core.multiarray.where}
        2    0.333    0.166    0.333    0.166 {pandas._join.outer_join_indexer_int64}
        1    0.331    0.331  142.164  142.164 <string>:1(<module>)

It looks like most of the time is being spent on reindex_variables. I understand why this happens…xarray needs to make sure the dimensions are the same in order to concatenate them together.

Is there any obvious way I could improve the load time? For example, can I give a hint to xarray that this reindex_variables step is not necessary, since I know that all the npart dimensions are the same in each file?

Possibly related to #1301 and #1340.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:39 (21 by maintainers)

github_iconTop GitHub Comments

5reactions
rabernatcommented, Dec 5, 2019

In your twitter thread you said

Do any of my xarray/dask folks know why open_mfdataset takes such a significant amount of time compared to looping over a list of files? Each file corresponds to a new time, just wanting to open multiple times at once…

The general reason for this is usually that open_mfdataset performs coordinate compatibility checks when it concatenates the files. It’s useful to actually read the code of open_mfdataset to see how it works.

First, all the files are opened individually https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L900-L903

You can recreate this step outside of xarray yourself by doing something like

from glob import glob
datasets = [xr.open_dataset(fname, chunks={}) for fname in glob('*.nc')]

Once each dataset is open, xarray calls out to one of its combine functions. This logic has gotten more complex over the years as different options have been introduced, but the gist is this: https://github.com/pydata/xarray/blob/577d3a75ea8bb25b99f9d31af8da14210cddff78/xarray/backends/api.py#L947-L952

You can reproduce this step outside of xarray, e.g.

ds = xr.concat(datasets, dim='time')

At that point, various checks will kick in to be sure that the coordinates in the different datasets are compatible. Performing these checks requires the data to be read eagerly, which can be a source of slow performance.

Without seeing more details about your files, it’s hard to know exactly where the issue lies. A good place to start is to simply drop all coordinates from your data as a preprocessing step.

def drop_all_coords(ds):
    return ds.reset_coords(drop=True)

xr.open_mfdataset('*.nc', combine='by_coords', preprocess=drop_all_coords)

If you observe a big speedup, this points at coordinate compatibility checks as the culprit. From there you can experiment with the various options for open_mfdataset, such as coords='minimal', compat='override', etc.

Once you post your file details, we can provide more concrete suggestions.

3reactions
shoyercommented, Mar 9, 2018

OK, so it seems that we need a change to disable wrapping dask arrays with LazilyIndexedArray. Dask arrays are already lazy!

Read more comments on GitHub >

github_iconTop Results From Across the Web

xarray load() running slowly on open_mfdataset() data
xarray.open_mfdataset will make a python list of xarray.Datasets and will concatenate them after all files parsed to the list.
Read more >
performance issues with open_mfdataset - Google Groups
I have a separate issue open for the slow performance of open_mfdataset due to this indexing and alignment step: https://github.com/pydata/xarray/issues/ ...
Read more >
Very slow selection of multiple points in large dataset using ...
Hello, I am trying to sample hourly model output to match the time and location of observations collected during a flight. There are...
Read more >
Optimizing aggregation and extraction of NetCDF files using ...
I already have a working code but it is very slow and my hunch is that ... I need to generate new variables...
Read more >
Out of core computation with dask — xarray 0.9.5 documentation
By default, open_mfdataset will chunk each netCDF file into a single dask array; again, ... dask performance can become very slow in certain...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found