shallow copies become deep copies when pickling
See original GitHub issueWhenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays. This design fails when the object is pickled.
Whenever a numpy view is pickled, it becomes a regular array:
>> a = numpy.arange(2**26)
>> print(len(pickle.dumps(a)) / 2**20)
256.00015354156494
>> b = a.view()
>> print(len(pickle.dumps((a, b))) / 2**20)
512.0001964569092
>> b.base is a
True
>> a2, b2 = pickle.loads(pickle.dumps((a, b)))
>> b2.base is a2
False
This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer. Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they’re pickled expand to several GBs as they become 3000 independent copies.
I see a few possible solutions to this:
- Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won’t benefit from it.
- Do not auto-generate a coord with numpy.arange() if the user doesn’t explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people’s.
- Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn’t solve the general problem.
- Fix the issue upstream in numpy. I didn’t look into it yet and it’s definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that…
- Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.
I implemented (5) as a workaround in my getstate method. Before:
%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
2.535497265867889
Wall time: 33.3 s
Workaround:
def get_base(array):
if not isinstance(array, numpy.ndarray):
return array
elif array.base is None:
return array
elif array.base.dtype != array.dtype:
return array
elif array.base.shape != array.shape:
return array
else:
return array.base
for v in cache.values():
if isinstance(v, xarray.DataArray):
v.data = get_base(v.data)
for coord in v.coords.values():
coord.data = get_base(coord.data)
elif isinstance(v, xarray.Dataset):
for var in v.variables():
var.data = get_base(var.data)
After:
%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
0.9733252348378301
Wall time: 21.1 s
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:10 (8 by maintainers)
Top GitHub Comments
I answered the StackOverflow question: https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling/40247761#40247761
This was a tricky puzzle to figure out!
Alternatively, it could make sense to change pickle upstream in NumPy to special case arrays with a stride of 0 along some dimension differently.