Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Converting sparse matrices directly to persistent zarr arrays

See original GitHub issue

I’m trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to a persistent zarr array. Here is the code I’m using:

import tables
import scipy.sparse as sp_sparse
import zarr

f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
                               f.root.mm10.indices[:],
                               f.root.mm10.indptr[:]),
                              shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)

However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (https://github.com/scipy/scipy/issues/7230).

So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.

If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:

Traceback (most recent call last):
  File "convert.py", line 23, in <module>
    zarr.array(matrix, store='1m.zarr', overwrite=True)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
    z[:] = data
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
    value[value_selection])
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
    self._chunk_setitem_nosync(cidx, item, value)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
    chunk = np.ascontiguousarray(value, dtype=self._dtype)
  File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done

Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

This issue can also be reproduced easily via the following code:

import numpy as np
import scipy.sparse as sp
import zarr

x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)

Issue Analytics

State:
Created 6 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

jakirkhamcommented, Jun 19, 2019

cc @ryan-williams (in case this and/or the xrefs are of interest)

1reaction

jakirkhamcommented, Feb 14, 2018

Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report.