question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot specify pickle protocol used when writing HDF5.

See original GitHub issue

Problem description

It appears currently impossible (unless I’m mistaken) to specify what pickle protocol will be used by DataFrame.to_hdf() (and thus by PyTables) if pickling is necessary. That makes it impossible to share HDF5 data written and read by a mix of clients running py37 and py38.

In Python 3.8, pickle protocol 5 was introduced (PEP-574). This prevents my team from supporting py38 in a system meant to support clients running a range of Python versions (from 3.6) and sharing a common distributed filesystem. We plan to eventually deprecate support for py36 and py37, but the problem is that there doesn’t seem to be a way to manage the transition when such a new protocol is introduced.

xref: this StackOverflow question.

Example

(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan  8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
    return it.get_result()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
    results = self.func(self.start, self.stop, where)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
    "block{idx}_values".format(idx=i), start=_start, stop=_stop
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
    ret = node[0][start:stop]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
    return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>

Expected Behavior

df.to_hdf('foo', 'x', pickle_protocol=4)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

4reactions
pdemarticommented, Mar 28, 2020

Oh, to be clear, I’m not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn’t care less about what pickle protocol it uses, except that I want all my clients to be able to read it.

Alas, in those cases where DataFrame.to_hdf() resorts to some pickling, it is always with HIGHEST_PROTOCOL. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to let Pandas and PyTable know that, even if py38 is running, and in the case pickling is used somehow by to_hdf, then we wish it to use a specific protocol (e.g. 4).

0reactions
mroeschkecommented, Mar 13, 2022

Closing as this is an issue with pytables.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is it possible to specify the pickle protocol when writing ...
The solution is to reprocess the whole data to a pickle (or csv) and re-transform it in python3.7 to a hdf5 (which only...
Read more >
[Solved]-Is it possible to specify the pickle protocol when writing ...
Coding example for the question Is it possible to specify the pickle protocol when writing pandas to HDF5?-Pandas,Python.
Read more >
Storing large Numpy arrays on disk: Python Pickle vs. HDF5
The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple...
Read more >
PEP 574 – Pickle protocol 5 with out-of-band data
The pickle protocol was originally designed in 1995 for on-disk persistency of arbitrary Python objects. The performance of a 1995-era storage medium probably ......
Read more >
reading/writing tables from files, databases and other sources
import petl as etl >>> import pickle >>> # set up a file to demonstrate with ... with open('example.p', 'wb') as f: ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found