Dimension mismatch when reading categorical HDF5 column with weird NaN encoding
See original GitHub issueHi guys, I got a crash reading an HDF5 file with a message “ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)” (stack trace below).
I think the problem arises whenever reading a categorical column via pytables where NaN is stored as one of the categories (rather than a special code -1). I’m not sure in which situations pytables generates one or the other representation of NaN. (any explanation is appreciated)
In my case, 171285 is the number of rows in my data and 15 is the number of categories including NaN.
The offending line is:
I’m not sure what’s going on here or why we need this code. (any explanations are appreciated) But indeed, codes
has size # rows
and mask
has size # categories
. (mask[i] is True iff categories[i] is NaN) So this seems wrong. Looks like we’d rather want some kind of join equivalent to
non_nan_codes = codes[codes != -1]
delta = mask.astype(int).cumsum().values
for i in non_nan_codes:
non_nan_codes[i] -= delta[non_nan_codes[i]]
# Right now, we got something equivalent to delta[i], not delta[non_nan_codes[i]]
I don’t know how to do this fast in raw numpy, though.
Stack trace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-0fe86b15df04> in <module>()
----> 1 df = pd.read_hdf(hdfs[5])
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_hdf(path_or_buf, key, **kwargs)
356 'contains multiple datasets.')
357 key = candidate_only_group._v_pathname
--> 358 return store.select(key, auto_close=auto_close, **kwargs)
359 except:
360 # if there is an error, close the store
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
720 chunksize=chunksize, auto_close=auto_close)
721
--> 722 return it.get_result()
723
724 def select_as_coordinates(
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in get_result(self, coordinates)
1426
1427 # directly return the result
-> 1428 results = self.func(self.start, self.stop, where)
1429 self.close()
1430 return results
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in func(_start, _stop, _where)
713 return s.read(start=_start, stop=_stop,
714 where=_where,
--> 715 columns=columns, **kwargs)
716
717 # create the iterator
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
4101 def read(self, where=None, columns=None, **kwargs):
4102
-> 4103 if not self.read_axes(where=where, **kwargs):
4104 return None
4105
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
3306 for a in self.axes:
3307 a.set_info(self.info)
-> 3308 a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
3309
3310 return True
C:\Program Files (x86)\Anaconda3\Lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
2112 if mask.any():
2113 categories = categories[~mask]
-> 2114 codes[codes != -1] -= mask.astype(int).cumsum().values
2115
2116 self.data = Categorical.from_codes(codes,
ValueError: operands could not be broadcast together with shapes (171285,) (15,) (171285,)
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
FYI, as for the question why pandas creates such a HDF5, I think I figured it out:
In my code, I was doing some data conversions that turned out to be ultimately akin to something like:
Turned out that this creates a regular category
'nan'
(notnan
, mind the quotes!). This is then written to HDF.'nan'
also happens to be the defaultnan_rep
used by pytables, so when pytables reads the categories in the HDF file back in, it’s converted intonan
(without quotes). So we now have anan
category, which leads to the offending code being executed.Ran into this problem today, same exact scenario where the string
'nan'
is a value in the column due to usingdf[col_name] = df[col_name].astype(str).astype('category')
.results in the error
How to [temporarily until a PR] fix it using
nan_rep=np.nan
(notnan_rep='nan'
, also this parameter’s documentation is lacking)