question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

shallow copies become deep copies when pickling

See original GitHub issue

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

>> a = numpy.arange(2**26)
>> print(len(pickle.dumps(a)) / 2**20)
256.00015354156494
>> b = a.view()
>> print(len(pickle.dumps((a, b))) / 2**20)
512.0001964569092
>> b.base is a
True
>> a2, b2 = pickle.loads(pickle.dumps((a, b)))
>> b2.base is a2
False

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they’re pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this:

  1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won’t benefit from it.
  2. Do not auto-generate a coord with numpy.arange() if the user doesn’t explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people’s.
  3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn’t solve the general problem.
  4. Fix the issue upstream in numpy. I didn’t look into it yet and it’s definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that…
  5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method. Before:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
2.535497265867889
Wall time: 33.3 s

Workaround:

def get_base(array):
    if not isinstance(array, numpy.ndarray):
        return array      
    elif array.base is None:
        return array
    elif array.base.dtype != array.dtype:
        return array
    elif array.base.shape != array.shape:
        return array
    else:
        return array.base

for v in cache.values():
    if isinstance(v, xarray.DataArray):
        v.data = get_base(v.data)
        for coord in v.coords.values():
            coord.data = get_base(coord.data)
    elif isinstance(v, xarray.Dataset):
        for var in v.variables():
            var.data = get_base(var.data)

After:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
0.9733252348378301
Wall time: 21.1 s

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
shoyercommented, Oct 25, 2016

I answered the StackOverflow question: https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling/40247761#40247761

This was a tricky puzzle to figure out!

0reactions
shoyercommented, Feb 5, 2017

Alternatively, it could make sense to change pickle upstream in NumPy to special case arrays with a stride of 0 along some dimension differently.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Relationship between pickle and deepcopy - Stack Overflow
A naive alternative implementation of deepcopy would be pickle.loads(pickle.dumps(obj)) . However, this can't possibly be equivalent to deepcopy ...
Read more >
copy — Shallow and deep copy operations — Python 3.11.1 ...
A deep copy constructs a new compound object and then, recursively, inserts copies into it of the objects found in the original. Two...
Read more >
Python Deep Copy and Shallow Copy with Examples
Learn Python deep & Shallow copy. Deep copy makes copying process recursive. Shallow copy creates new object which stores reference of original elements....
Read more >
Shallow vs Deep Copying of Python Objects
A shallow copy means constructing a new collection object and then populating it with references to the child objects found in the original....
Read more >
Python: module copy
The difference between shallow and deep copying is only relevant for ... administrative data structures that should be shared even between copies
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found