savez fails on large array of objects
See original GitHub issueI’m getting a RuntimeError “File size unexpectedly exceeded ZIP64 limit” when using savez
to save a few arrays to a file, the largest of which is an array of matrices each with a different number of rows (so that the top level array’s dtype is object). As best I can tell, the condition used by numpy internally to decide whether or not to pass force_zip64=True
to the zipfile open is checking if any of the arrays to be put into the archive has nbytes greater than 2^30. My array of matrices reports nbytes less than 2^30 but in reality the total number of bytes exceeds this (it’s about 1.7GB). I’m using python 3.6 and numpy 1.14.2 on Ubuntu 16.04.
The following code produces the error:
test_data = np.asarray([np.random.rand(np.random.randint(50,100),4) for i in range(800000)])
np.savez('test', test_data=test_data)
whereas changing np.random.randint(50,100)
->75
produces no error.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:14 (11 by maintainers)
Hi. I have reported #13153 and noticed that this issue is due to the same root cause.
The problem of current numpy is that
_savez
doesn’t correctly get the actual size of the data in case it is either a dict or a list (of np arrays), so I think modifying this part so that it can properly identify the correct size would solve the issue, while keeping the current conditional force_zip64 behavior.Let me propose the brief direction of this way below, as a modification of line 727-728. https://github.com/numpy/numpy/blob/d89c4c7e7850578e5ee61e3e09abd86318906975/numpy/lib/npyio.py#L727-L728
Again, I know that this change is still totally incomplete and dirty, but since it is quite different approach from what is proposed above, I just wanted to show a direction first.
This would solve the both cases with list and dict (be noted that I didn’t convert the list to np.array (first line) here for simplicity), although it sets
force_zip64
only when necessary.I agree, that would be the simplest and the best. I just felt like it’s a bit too long to wait until np+py2 EOL (almost) indefinitely for this kind of a bug, but as you mention it’ll ready in 1.17, that’s OK for me too (of course I want this fix right now, though:) ).