question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_hdf throws UnicodeDecodeError with Python 3.5 and 3.6 but not with Python 2.7

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.read_hdf('data.h5')

Problem description

The HDF5 dataset was created with pandas, to_hdf in Python 2.7 and can be read in by Python 2.7. When I try to read it in with Python 3.5 or Python 3.6, I get the following:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-53006689fd2c> in <module>()
----> 1 df = pd.read_hdf(data.h5')

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
    356                                      'contains multiple datasets.')
    357             key = candidate_only_group._v_pathname
--> 358         return store.select(key, auto_close=auto_close, **kwargs)
    359     except:
    360         # if there is an error, close the store

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    720                            chunksize=chunksize, auto_close=auto_close)
    721 
--> 722         return it.get_result()
    723 
    724     def select_as_coordinates(

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1426 
   1427         # directly return the result
-> 1428         results = self.func(self.start, self.stop, where)
   1429         self.close()
   1430         return results

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    713             return s.read(start=_start, stop=_stop,
    714                           where=_where,
--> 715                           columns=columns, **kwargs)
    716 
    717         # create the iterator

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2864             blk_items = self.read_index('block%d_items' % i)
   2865             values = self.read_array('block%d_values' % i,
-> 2866                                      start=_start, stop=_stop)
   2867             blk = make_block(values,
   2868                              placement=items.get_indexer(blk_items))

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/pandas/io/pytables.py in read_array(self, key, start, stop)
   2413         import tables
   2414         node = getattr(self.group, key)
-> 2415         data = node[start:stop]
   2416         attrs = node._v_attrs
   2417 

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in __getitem__(self, key)
    673             start, stop, step = self._process_range(
    674                 key.start, key.stop, key.step)
--> 675             return self.read(start, stop, step)
    676         # Try with a boolean or point selection
    677         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in read(self, start, stop, step)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/vlarray.py in <listcomp>(.0)
    813         atom = self.atom
    814         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 815             outlistarr = [atom.fromarray(arr) for arr in listarr]
    816         else:
    817             # Convert the list to the right flavor

/home/tct/anaconda2/envs/py36/lib/python3.6/site-packages/tables/atom.py in fromarray(self, array)
   1226         if array.size == 0:
   1227             return None
-> 1228         return six.moves.cPickle.loads(array.tostring())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23: ordinal not in range(128)

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

In [1]: import pandas as pd
In [2]: df = pd.read_hdf('data.h5')

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:3
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

5reactions
zoofcommented, Sep 19, 2017

Just a postscript. format='table' only works for a single column of data. When trying to save the entire dataset in Python 2.7,

TypeError: Cannot serialize the column [task_list] because
its data contents are [unicode] object dtype

when saving using encoding='utf-8' the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_

0reactions
envhyfcommented, Oct 15, 2019

Just a postscript. format='table' only works for a single column of data. When trying to save the entire dataset in Python 2.7,

TypeError: Cannot serialize the column [task_list] because
its data contents are [unicode] object dtype

when saving using encoding='utf-8' the file is saved but again cannot be read in 3.x. TypeError: lookup() argument must be str, not numpy.bytes_

Hi, I met a similar issue. The dataframe was saved in Python 2.7 with format ='table', encoding ='utf-8'. However, when I read it in Python 3.7 by pd.read_hdf(‘xxx.hdf’, key=‘xx’,encoding = ‘utf-8’). The error shows like: lookup() argument must be str, not numpy.bytes_

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve UnicodeDecodeError in Python 3.6?
I had UnicodeDecodeError in my Python 2.7 scripts and I solved by this. # encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8').
Read more >
codecs — Codec registry and base classes — Python 3.11.1 ...
The default error handler is 'strict' meaning that encoding errors raise ValueError (or a more codec specific subclass, such as UnicodeEncodeError ).
Read more >
Unicodedecodeerror When Trying To Read An Hdf File Made ...
Perbaiki lokal Anda: Bagaimana mengatasi UnicodeDecodeError di Python 3.6? ... copypastable read hdf throws misguided exception when file or key not found.
Read more >
Community Updates — Requests 2.28.1 documentation
Requests support for Python 2.7 and 3.6 will be ending in 2022. ... URLs with schemes that begin with http but are not...
Read more >
Python 3 | The Making of Close
In Python 2.7, PYTHONHASHSEED defaults to 0 (disabled) but can be set ... python3 --version Python 3.5.2 $ python3 -c "print(dict([(i,idx) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found