Converting sparse matrices directly to persistent zarr arrays
See original GitHub issueI’m trying to convert a huge scipy csc_matrix (27998 x 1306127
, int32
) to a persistent zarr array. Here is the code I’m using:
import tables
import scipy.sparse as sp_sparse
import zarr
f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
f.root.mm10.indices[:],
f.root.mm10.indptr[:]),
shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)
However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (https://github.com/scipy/scipy/issues/7230).
So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.
If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:
Traceback (most recent call last):
File "convert.py", line 23, in <module>
zarr.array(matrix, store='1m.zarr', overwrite=True)
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
z[:] = data
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
value[value_selection])
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
self._chunk_setitem_nosync(cidx, item, value)
File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
chunk = np.ascontiguousarray(value, dtype=self._dtype)
File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done
Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
This issue can also be reproduced easily via the following code:
import numpy as np
import scipy.sparse as sp
import zarr
x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Tutorial — zarr 2.13.3 documentation - Read the Docs
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and...
Read more >Release notes — zarr 0.1.dev52 documentation
Zarr arrays now support NumPy-style fancy indexing with arrays of integer coordinates. This is equivalent to using zarr.Array.vindex.
Read more >The Array class (zarr.core) — zarr 2.13.3 documentation
Object used to synchronize write access to the array. filters. One or more codecs used to transform data prior to compression. attrs.
Read more >Release notes — zarr 2.11.0 documentation
Sparse changes with performance impact! One of the advantages of the Zarr format is that it is sparse, which means that chunks with...
Read more >Scientific python and sparse arrays (scipy summary + future ...
For example, when performing matrix multiplication, one has to multiply elements together pairwise and then sum up those that contribute to the ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
cc @ryan-williams (in case this and/or the xrefs are of interest)
Going to close as Zarr
2.2.0rc3
supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report.