question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

problem with open_mfdataset on data with duplicate coordinate variables

See original GitHub issue

I’m having a problem using open_mfdataset to open a set of files that have a “duplicate” dimension. In this case, I’m using a numerical weather model that I have configured on a 300x300 grid. The files are originally in HDF5 format with unnamed dimensions. Normally, the model stores the x and y dimensions as phony_dim_0 and phony_dim_1

However, in this case since the x and y dimensions are the same, the model has saved both x and y dimensions in a variable as phony_dim_0. This creates a variable (Theta, for example), with a ncdump header like:

netcdf out1 {
dimensions:
        phony_dim_1 = 60 ;
        phony_dim_0 = 300 ;
variables:
        float THETA(phony_dim_1, phony_dim_0, phony_dim_0) ;

The output files also do not include time information, so I make sure they’re named sequentially and open them with a concat_dim='TIME' to force concatenation. However, due to the duplicate dimensionality problem, I get the error below.

MCVE Code Sample

I have uploaded sample data and a small program to attempt to open the files here: https://drive.google.com/file/d/1aayITXcwrAP_w9uNqppd9mpaQf3O51s8/view?usp=sharing

import xarray as xr

ds = xr.open_mfdataset("./out*.nc", concat_dim='TIME')
print(ds)

Expected Output

The expected output is to print the contents of the dataset ds to the screen

Problem Description

The program errors out with the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-0079210385b2> in <module>
----> 1 control = xr.open_mfdataset(datadir + "feb2014_control/icefix*g1.h5", concat_dim='TIME')

~/anaconda3/lib/python3.7/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, autoclose, parallel, **kwargs)
    717             data_vars=data_vars, coords=coords,
    718             infer_order_from_coords=infer_order_from_coords,
--> 719             ids=ids)
    720     except ValueError:
    721         for ds in datasets:

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _auto_combine(datasets, concat_dims, compat, data_vars, coords, infer_order_from_coords, ids)
    551     # Repeatedly concatenate then merge along each dimension
    552     combined = _combine_nd(combined_ids, concat_dims, compat=compat,
--> 553                            data_vars=data_vars, coords=coords)
    554     return combined
    555 

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat)
    473                                                          data_vars=data_vars,
    474                                                          coords=coords,
--> 475                                                          compat=compat)
    476     combined_ds = list(combined_ids.values())[0]
    477     return combined_ds

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _auto_combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat)
    491         datasets = combined_ids.values()
    492         new_combined_ids[new_id] = _auto_combine_1d(datasets, dim, compat,
--> 493                                                     data_vars, coords)
    494     return new_combined_ids
    495 

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _auto_combine_1d(datasets, concat_dim, compat, data_vars, coords)
    509         concatenated = [_auto_concat(list(ds_group), dim=dim,
    510                                      data_vars=data_vars, coords=coords)
--> 511                         for id, ds_group in grouped_by_vars]
    512     else:
    513         concatenated = datasets

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in <listcomp>(.0)
    509         concatenated = [_auto_concat(list(ds_group), dim=dim,
    510                                      data_vars=data_vars, coords=coords)
--> 511                         for id, ds_group in grouped_by_vars]
    512     else:
    513         concatenated = datasets

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _auto_concat(datasets, dim, data_vars, coords)
    367                                  'explicitly')
    368             dim, = concat_dims
--> 369         return concat(datasets, dim=dim, data_vars=data_vars, coords=coords)
    370 
    371 

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in concat(objs, dim, data_vars, coords, compat, positions, indexers, mode, concat_over)
    118         raise TypeError('can only concatenate xarray Dataset and DataArray '
    119                         'objects, got %s' % type(first_obj))
--> 120     return f(objs, dim, data_vars, coords, compat, positions)
    121 
    122 

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in _dataset_concat(datasets, dim, data_vars, coords, compat, positions)
    303         if k in concat_over:
    304             vars = ensure_common_dims([ds.variables[k] for ds in datasets])
--> 305             combined = concat_vars(vars, dim, positions)
    306             insert_result_variable(k, combined)
    307 

~/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in concat(variables, dim, positions, shortcut)
   2083         along the given dimension.
   2084     """
-> 2085     variables = list(variables)
   2086     if all(isinstance(v, IndexVariable) for v in variables):
   2087         return IndexVariable.concat(variables, dim, positions, shortcut)

~/anaconda3/lib/python3.7/site-packages/xarray/core/combine.py in ensure_common_dims(vars)
    296                 common_shape = tuple(non_concat_dims.get(d, dim_len)
    297                                      for d in common_dims)
--> 298                 var = var.set_dims(common_dims, common_shape)
    299             yield var
    300 

~/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in set_dims(self, dims, shape)
   1209         expanded_var = Variable(expanded_dims, expanded_data, self._attrs,
   1210                                 self._encoding, fastpath=True)
-> 1211         return expanded_var.transpose(*dims)
   1212 
   1213     def _stack_once(self, dims, new_dim):

~/anaconda3/lib/python3.7/site-packages/xarray/core/variable.py in transpose(self, *dims)
   1152             return self.copy(deep=False)
   1153 
-> 1154         data = as_indexable(self._data).transpose(axes)
   1155         return type(self)(dims, data, self._attrs, self._encoding,
   1156                           fastpath=True)

~/anaconda3/lib/python3.7/site-packages/xarray/core/indexing.py in transpose(self, order)
   1210 
   1211     def transpose(self, order):
-> 1212         return self.array.transpose(order)
   1213 
   1214 

~/anaconda3/lib/python3.7/site-packages/dask/array/core.py in transpose(self, *axes)
   1633         elif len(axes) == 1 and isinstance(axes[0], Iterable):
   1634             axes = axes[0]
-> 1635         return transpose(self, axes=axes)
   1636 
   1637     @derived_from(np.ndarray)

~/anaconda3/lib/python3.7/site-packages/dask/array/routines.py in transpose(a, axes)
    155     if axes:
    156         if len(axes) != a.ndim:
--> 157             raise ValueError("axes don't match array")
    158     else:
    159         axes = tuple(range(a.ndim))[::-1]

ValueError: axes don't match array

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 5.0.0-21-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1

xarray: 0.12.1 pandas: 0.24.2 numpy: 1.16.4 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.1.0 distributed: 2.1.0 matplotlib: 3.1.0 cartopy: 0.16.0 seaborn: 0.9.0 setuptools: 41.0.1 pip: 19.1.1 conda: 4.7.5 pytest: None IPython: 7.6.1 sphinx: 2.1.2 None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
shoyercommented, Aug 1, 2019

It would be nice if we handled duplicate dimensions better, but it’s relatively rare so I’m not sure it’s worth making our code significantly more complex.

On Wed, Jul 31, 2019 at 9:26 AM Lucas notifications@github.com wrote:

Okay I created the preprocess function below:

def fix_dims(ds): dims = ds.dims

ds_fix = xr.Dataset(coords={'z': np.arange(dims['phony_dim_1']),
                            'y': np.arange(dims['phony_dim_0']),
                            'x': np.arange(dims['phony_dim_0'])})
for v in vars:
    dim_len = len(ds[v].dims)

    if dim_len == 3:
        ds_fix[v] = (('z', 'y', 'x'), ds[v])
    elif dim_len == 2:
        ds_fix[v] = (('y', 'x'), ds[v])
    else:
        print("Error with dimensions")

return(ds_fix)

And when I run ds = open_mfdataset(‘<data_files>’, concat_dim=‘time’, preprocess=fix_dims’), it creates a dataset filled with the variables listed in the vars list.

Is there any interest in having xarray detect this problem in the future and automatically work around it? I’d be interested in possibly trying to figure that out.

Thanks for all the help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3164?email_source=notifications&email_token=AAJJFVQBWNPD5HGBV364BN3QCG4MTA5CNFSM4IHIDKW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3HZY7A#issuecomment-516922492, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJJFVXMXMKFYT2LKABDULTQCG4MTANCNFSM4IHIDKWQ .

0reactions
lsterzingercommented, Jul 31, 2019

Okay I created the preprocess function below:

def fix_dims(ds):
    dims = ds.dims

    ds_fix = xr.Dataset(coords={'z': np.arange(dims['phony_dim_1']),
                                'y': np.arange(dims['phony_dim_0']),
                                'x': np.arange(dims['phony_dim_0'])})
    for v in vars:
        dim_len = len(ds[v].dims)
        
        if dim_len == 3:
            ds_fix[v] = (('z', 'y', 'x'), ds[v])
        elif dim_len == 2:
            ds_fix[v] = (('y', 'x'), ds[v])
        else:
            print("Error with dimensions")

    return(ds_fix)

And when I run ds = open_mfdataset('<data_files>', concat_dim='time', preprocess=fix_dims'), it creates a dataset filled with the variables listed in the vars list.

Is there any interest in having xarray detect this problem in the future and automatically work around it? I’d be interested in possibly trying to figure that out.

Thanks for all the help!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error on using xarray open_mfdataset function - Stack Overflow
This error message is probably arising because you have two files with the same variables and coordinate values, and xarray doesn't know ...
Read more >
xarray.open_mfdataset
“different”: Data variables which are not equal (ignoring attributes) across all datasets are also concatenated (as well as all for which dimension already ......
Read more >
What's New — xarray 0.11.1 documentation - PyData |
Iterating over a Dataset now includes only data variables, not coordinates. Similarily, calling len and bool on a Dataset now includes only ...
Read more >
Getting all duplicates within a SAS data set
The Wrong Way to Obtain Duplicates in SAS You may think that PROC SORT can ... of a data set, you can take...
Read more >
What's New — xarray 0.9.6+dev240.g5a28b89 documentation
The variables , attrs , and dimensions properties have been deprecated as part ... and if ds ) currently includes all variables, both...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found