Cannot specify pickle protocol used when writing HDF5.
See original GitHub issueProblem description
It appears currently impossible (unless I’m mistaken) to specify what pickle protocol will be used by DataFrame.to_hdf()
(and thus by PyTables
) if pickling is necessary. That makes it impossible to share HDF5 data written and read by a mix of clients running py37 and py38.
In Python 3.8, pickle protocol 5 was introduced (PEP-574). This prevents my team from supporting py38 in a system meant to support clients running a range of Python versions (from 3.6) and sharing a common distributed filesystem. We plan to eventually deprecate support for py36 and py37, but the problem is that there doesn’t seem to be a way to manage the transition when such a new protocol is introduced.
xref: this StackOverflow question.
Example
(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan 8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
return it.get_result()
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
results = self.func(self.start, self.stop, where)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
return s.read(start=_start, stop=_stop, where=_where, columns=columns)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
"block{idx}_values".format(idx=i), start=_start, stop=_stop
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
ret = node[0][start:stop]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
return self.read(start, stop, step)[0]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
outlistarr = [atom.fromarray(arr) for arr in listarr]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
outlistarr = [atom.fromarray(arr) for arr in listarr]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>
Expected Behavior
df.to_hdf('foo', 'x', pickle_protocol=4)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Is it possible to specify the pickle protocol when writing ...
The solution is to reprocess the whole data to a pickle (or csv) and re-transform it in python3.7 to a hdf5 (which only...
Read more >[Solved]-Is it possible to specify the pickle protocol when writing ...
Coding example for the question Is it possible to specify the pickle protocol when writing pandas to HDF5?-Pandas,Python.
Read more >Storing large Numpy arrays on disk: Python Pickle vs. HDF5
The reason is that HDF is a binary data pipe, while Pickle is an object serialization protocol. Pickle actually consists of a simple...
Read more >PEP 574 – Pickle protocol 5 with out-of-band data
The pickle protocol was originally designed in 1995 for on-disk persistency of arbitrary Python objects. The performance of a 1995-era storage medium probably ......
Read more >reading/writing tables from files, databases and other sources
import petl as etl >>> import pickle >>> # set up a file to demonstrate with ... with open('example.p', 'wb') as f: ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Oh, to be clear, I’m not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn’t care less about what pickle protocol it uses, except that I want all my clients to be able to read it.
Alas, in those cases where
DataFrame.to_hdf()
resorts to some pickling, it is always withHIGHEST_PROTOCOL
. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to letPandas
andPyTable
know that, even if py38 is running, and in the case pickling is used somehow byto_hdf
, then we wish it to use a specific protocol (e.g. 4).Closing as this is an issue with pytables.