Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickle is significantly slower than a memory copy

See original GitHub issue

My machine copies memory at 5GB/s

In [1]: b = b'0' * 1000000000

In [2]: %time len(b[1:])
CPU times: user 139 ms, sys: 63.3 ms, total: 202 ms
Wall time: 202 ms
Out[2]: 999999999

But NumPy arrays only serialize at 2.5 GB/s

In [4]: import numpy as np

In [5]: x = np.random.randint(0, 255, dtype='u1', size=1000000000)  # 1GB

In [6]: import pickle

In [7]: %time len(pickle.dumps(x, protocol=-1))
CPU times: user 309 ms, sys: 96.2 ms, total: 405 ms
Wall time: 404 ms
Out[7]: 1000000161

Why the extra time?

Versions

Python 3.4, Linux, NumPy 1.11.0

Issue Analytics

State:
Created 7 years ago
Reactions:3
Comments:9 (9 by maintainers)

Top GitHub Comments

6reactions

mattipcommented, May 7, 2019

Support for protocol 5 has been merged. Closing. If the copy benchmark analysis leads to another issue, please open a new one.

3reactions

shoyercommented, Apr 14, 2016

The answer is that ndarray.__reduce__ uses tostring() internally (making a copy) and then pickle.dumps makes an additional copy of any data is receives from __reduce__ (writing it into a io.BytesIO, of course).

Compare:

In [22]: %time _ = x.copy()
CPU times: user 286 ms, sys: 324 ms, total: 609 ms
Wall time: 609 ms

In [23]: %time _ = x.__reduce__()
CPU times: user 296 ms, sys: 320 ms, total: 616 ms
Wall time: 615 ms

In [24]: %time _ = pickle.dumps(x, protocol=-1)
CPU times: user 606 ms, sys: 682 ms, total: 1.29 s
Wall time: 1.29 s

It might be possible to do the pickling without an additional copy, but as far as I can tell based on the current design of pickle, that would require converting numpy arrays into bytes or another builtin Python type supported by pickle without a copy (you can’t pickle memory views). Unfortunately, as @teoliphant explains, converting numpy arrays into strings without a copy isn’t possible.

So I guess you could either try to get first class support for memoryview objects into pickle (maybe not a bad idea) or roll your own serialization format.