Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dimension mismatch when reading categorical HDF5 column with weird NaN encoding

See original GitHub issue

Hi guys, I got a crash reading an HDF5 file with a message “ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)” (stack trace below).

I think the problem arises whenever reading a categorical column via pytables where NaN is stored as one of the categories (rather than a special code -1). I’m not sure in which situations pytables generates one or the other representation of NaN. (any explanation is appreciated)

In my case, 171285 is the number of rows in my data and 15 is the number of categories including NaN.

The offending line is:

https://github.com/pandas-dev/pandas/blob/10709762e95be2b98d68b95db93a287ee5af74d4/pandas/io/pytables.py#L2224

I’m not sure what’s going on here or why we need this code. (any explanations are appreciated) But indeed, codes has size # rows and mask has size # categories. (mask[i] is True iff categories[i] is NaN) So this seems wrong. Looks like we’d rather want some kind of join equivalent to

non_nan_codes = codes[codes != -1]
delta = mask.astype(int).cumsum().values
for i in non_nan_codes:
    non_nan_codes[i] -= delta[non_nan_codes[i]]
    # Right now, we got something equivalent to delta[i], not delta[non_nan_codes[i]]

I don’t know how to do this fast in raw numpy, though.

Stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-0fe86b15df04> in <module>()
----> 1 df = pd.read_hdf(hdfs[5])

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4101     def read(self, where=None, columns=None, **kwargs):
   4102 
-> 4103         if not self.read_axes(where=where, **kwargs):
   4104             return None
   4105 

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3306         for a in self.axes:
   3307             a.set_info(self.info)
-> 3308             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3309 
   3310         return True

C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2112                 if mask.any():
   2113                     categories = categories[~mask]
-> 2114                     codes[codes != -1] -= mask.astype(int).cumsum().values
   2115 
   2116                 self.data = Categorical.from_codes(codes,

ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

sschuldenzuckercommented, Jul 5, 2018

FYI, as for the question why pandas creates such a HDF5, I think I figured it out:

In my code, I was doing some data conversions that turned out to be ultimately akin to something like:

s = pd.Series(['foo', nan, 'bar', 'foo']).astype(str).astype('category')

Turned out that this creates a regular category 'nan' (not nan, mind the quotes!). This is then written to HDF. 'nan' also happens to be the default nan_rep used by pytables, so when pytables reads the categories in the HDF file back in, it’s converted into nan (without quotes). So we now have a nan category, which leads to the offending code being executed.

0reactions

joseortiz3commented, Jan 29, 2019

Ran into this problem today, same exact scenario where the string 'nan' is a value in the column due to using df[col_name] = df[col_name].astype(str).astype('category').

>>> df_mixed = pd.DataFrame({'A': np.random.randn(8),
... 'B': np.random.randn(8),
... 'C': np.array(np.random.randn(8), dtype='float32'),
... 'string': 'string',
... 'int': 1,
... 'bool': True,
... 'datetime64': pd.Timestamp('20010102')},
... index=list(range(8)))
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3 -0.246141  0.786121  1.483667  string    1  True 2001-01-02
4  1.760388  1.675248  1.169727  string    1  True 2001-01-02
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed.loc[df_mixed.index[3:5],
... ['A', 'B', 'string', 'datetime64']] = np.nan
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     NaN    1  True        NaT
4       NaN       NaN  1.169727     NaN    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
nan
>>> type(df_mixed['string'].iloc[4])
<class 'float'>
>>> df_mixed['string'] = df_mixed['string'].astype(str).astype('category')
>>> df_mixed
          A         B         C  string  int  bool datetime64
0 -0.833346 -0.598527  1.013500  string    1  True 2001-01-02
1 -0.823901 -0.118210  0.793684  string    1  True 2001-01-02
2  0.725413 -0.867698  1.478408  string    1  True 2001-01-02
3       NaN       NaN  1.483667     nan    1  True        NaT
4       NaN       NaN  1.169727     nan    1  True        NaT
5 -0.000398  0.039454  1.514879  string    1  True 2001-01-02
6 -2.815542 -0.539987 -1.873862  string    1  True 2001-01-02
7  0.791794 -0.031423  1.250562  string    1  True 2001-01-02
>>> df_mixed['string'].iloc[4]
'nan'
>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table')
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

results in the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 394, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 741, in select
    return it.get_result()
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 1483, in get_result
    results = self.func(self.start, self.stop, where)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 734, in func
    columns=columns)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 4180, in read
    if not self.read_axes(where=where, **kwargs):
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 3383, in read_axes
    errors=self.errors)
  File "C:\Users\Joey\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\pytables.py", line 2177, in convert
    codes[codes != -1] -= mask.astype(int).cumsum().values
ValueError: operands could not be broadcast together with shapes (8,) (2,) (8,)

How to [temporarily until a PR] fix it using nan_rep=np.nan (not nan_rep='nan', also this parameter’s documentation is lacking)

>>> df_mixed.to_hdf(dir+'hdf.hdf',key='df_mixed',format='table',nan_rep=np.nan)
>>> df_mixed2 = pd.read_hdf(dir+'hdf.hdf',key='df_mixed',format='table')

Top Results From Across the Web

[Code]-'/' in names in HDF5 files confusion-pandas

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py...

What's New - Xarray

Fixed bug where Dataset.coarsen.construct() would demote non-dimension coordinates to variables. (PR7233) By Tom Nicholas. Raise a TypeError when trying ...

Version 0.19.0 (October 2, 2016) — pandas 1.4.4 documentation

read_csv() now supports parsing Categorical data, see here. A function union_categorical() has been added for combining categoricals, see here.

What's New — xarray 0.9.6+dev240.g5a28b89 documentation

Bug fixes in DataArray.plot.imshow() : all-NaN arrays and arrays with size one in some dimension can now be plotted, which is good for...

How to Make Predictions with Keras - Machine Learning Mastery

For this reason, you may want to save (pickle) the LabelEncoder used to encode your y values when fitting your final model.