question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

round-trip performance with save_mfdataset / open_mfdataset

See original GitHub issue

I have encountered some major performance bottlenecks in trying to write and then read multi-file netcdf datasets.

I start with an xarray dataset created by xgcm with the following repr:

<xarray.Dataset>
Dimensions:              (XC: 400, XG: 400, YC: 400, YG: 400, Z: 40, Zl: 40, Zp1: 41, Zu: 40, layer_1TH_bounds: 43, layer_1TH_center: 42, layer_1TH_interface: 41, time: 1566)
Coordinates:
    iter                 (time) int64 8294400 8294976 8295552 8296128 ...
  * time                 (time) int64 8294400 8294976 8295552 8296128 ...
  * XC                   (XC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ...
  * YG                   (YG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ...
  * XG                   (XG) >f4 0.0 5000.0 10000.0 15000.0 20000.0 25000.0 ...
  * YC                   (YC) >f4 2500.0 7500.0 12500.0 17500.0 22500.0 ...
  * Zu                   (Zu) >f4 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 -91.0 ...
  * Zl                   (Zl) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ...
  * Zp1                  (Zp1) >f4 0.0 -10.0 -20.0 -30.0 -42.0 -56.0 -72.0 ...
  * Z                    (Z) >f4 -5.0 -15.0 -25.0 -36.0 -49.0 -64.0 -81.5 ...
    rAz                  (YG, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
    dyC                  (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
    rAw                  (YC, XG) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
    dxC                  (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
    dxG                  (YG, XC) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
    dyG                  (YC, XG) >f4 5000.0 5000.0 5000.0 5000.0 5000.0 ...
    rAs                  (YG, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
    Depth                (YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    rA                   (YC, XC) >f4 2.5e+07 2.5e+07 2.5e+07 2.5e+07 ...
    PHrefF               (Zp1) >f4 0.0 98.1 196.2 294.3 412.02 549.36 706.32 ...
    PHrefC               (Z) >f4 49.05 147.15 245.25 353.16 480.69 627.84 ...
    drC                  (Zp1) >f4 5.0 10.0 10.0 11.0 13.0 15.0 17.5 20.5 ...
    drF                  (Z) >f4 10.0 10.0 10.0 12.0 14.0 16.0 19.0 22.0 ...
    hFacC                (Z, YC, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    hFacW                (Z, YC, XG) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    hFacS                (Z, YG, XC) >f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
  * layer_1TH_bounds     (layer_1TH_bounds) >f4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 ...
  * layer_1TH_interface  (layer_1TH_interface) >f4 0.0 0.2 0.4 0.6 0.8 1.0 ...
  * layer_1TH_center     (layer_1TH_center) float32 -0.1 0.1 0.3 0.5 0.7 0.9 ...
Data variables:
    T                    (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
    U                    (time, Z, YC, XG) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
    V                    (time, Z, YG, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
    S                    (time, Z, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 ...
    Eta                  (time, YC, XC) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    W                    (time, Zl, YC, XC) float32 -0.0 -0.0 -0.0 -0.0 -0.0 ...

An important point to note is that there are lots of “non-dimension coordinates” corresponding to various parameters of the numerical grid.

I save this dataset to a multi-file netCDF dataset as follows:

iternums, datasets = zip(*ds.groupby('time'))
paths = [outdir + 'xmitgcm_data.%010d.nc' % it for it in iternums]
xr.save_mfdataset(datasets, paths)

This takes many hours to run, since it has to read and write all the data. (I think there are some performance issues here too, related to how dask schedules the read / write tasks, but that is probably a separate issue.)

Then I try to re-load this dataset

ds_nc = xr.open_mfdataset('xmitgcm_data.*.nc')

This raises an error:

ValueError: too many different dimensions to concatenate: {'YG', 'Z', 'Zl', 'Zp1', 'layer_1TH_interface', 'YC', 'XC', 'layer_1TH_center', 'Zu', 'layer_1TH_bounds', 'XG'}

I need to specify concat_dim='time' in order to properly concatenate the data. It seems like this should be unnecessary, since I am reading back data that was just written with xarray, but I understand why (the dimensions of the Data Variables in each file are just Z, YC, XC, with no time dimension). Once I do that, it works, but it takes 18 minutes to load the dataset. I assume this is because it has to check the compatibility of all all the non-dimension coordinates.

I just thought I would document this, because 18 minutes seems way too long to load a dataset.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
shoyercommented, Mar 30, 2017

Indeed, it’s not. We should add some way to pipe this arguments through auto_combine on to concat.

1reaction
karenamckinnoncommented, Apr 26, 2017
TypeErrorTraceback (most recent call last)
<ipython-input-152-52564f498ac3> in <module>()
----> 1 ds = xr.open_mfdataset(files[:2], concat_dim='time')

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/backends/api.pyc in open_mfdataset(paths, chunks, concat_dim, preprocess, engine, lock, **kwargs)
    304         datasets = [preprocess(ds) for ds in datasets]
    305 
--> 306     combined = auto_combine(datasets, concat_dim=concat_dim)
    307     combined._file_obj = _MultiFileCloser(file_objs)
    308     return combined

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in auto_combine(datasets, concat_dim)
    376     grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)),
    377                                 datasets).values()
--> 378     concatenated = [_auto_concat(ds, dim=concat_dim) for ds in grouped]
    379     merged = merge(concatenated)
    380     return merged

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in _auto_concat(datasets, dim)
    327                                  'explicitly')
    328             dim, = concat_dims
--> 329         return concat(datasets, dim=dim)
    330 
    331 

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over)
    115         raise TypeError('can only concatenate xarray Dataset and DataArray '
    116                         'objects, got %s' % type(first_obj))
--> 117     return f(objs, dim, data_vars, coords, compat, positions)
    118 
    119 

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/combine.pyc in _dataset_concat(datasets, dim, data_vars, coords, compat, positions)
    206     dim, coord = _calc_concat_dim_coord(dim)
    207     datasets = [as_dataset(ds) for ds in datasets]
--> 208     datasets = align(*datasets, join='outer', copy=False, exclude=[dim])
    209 
    210     concat_over = _calc_concat_over(datasets, dim, data_vars, coords)

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/alignment.pyc in align(*objects, **kwargs)
     78     all_indexes = defaultdict(list)
     79     for obj in objects:
---> 80         for dim, index in iteritems(obj.indexes):
     81             if dim not in exclude:
     82                 all_indexes[dim].append(index)

/glade/apps/opt/python/2.7.7/gnu-westmere/4.8.2/lib/python2.7/_abcoll.pyc in iteritems(self)
    385         'D.iteritems() -> an iterator over the (key, value) items of D'
    386         for key in self:
--> 387             yield (key, self[key])
    388 
    389     def keys(self):

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/coordinates.pyc in __getitem__(self, key)
    245     def __getitem__(self, key):
    246         if key in self:
--> 247             return self._variables[key].to_index()
    248         else:
    249             raise KeyError(key)

/glade/u/home/mckinnon/.local/lib/python2.7/site-packages/xarray/core/variable.pyc in to_index(self)
   1173             index = index.set_names(valid_level_names)
   1174         else:
-> 1175             index = index.set_names(self.name)
   1176         return index
   1177 

/glade/apps/opt/pandas/0.14.0/gnu/4.8.2/lib/python2.7/site-packages/pandas-0.14.0-py2.7-linux-x86_64.egg/pandas/core/index.pyc in set_names(self, names, inplace)
    380         """
    381         if not com.is_list_like(names):
--> 382             raise TypeError("Must pass list-like as `names`.")
    383         if inplace:
    384             idx = self

TypeError: Must pass list-like as `names`.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance improvement of xarray.open_mfdataset() to ...
I have downloaded daily wind data from the ERA5 database in NetCDF format from 1979-2020 (42-years). Hence, I have 15330 .nc files. I...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found