question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Converting sparse matrices directly to persistent zarr arrays

See original GitHub issue

I’m trying to convert a huge scipy csc_matrix (27998 x 1306127, int32) to a persistent zarr array. Here is the code I’m using:

import tables
import scipy.sparse as sp_sparse
import zarr

f = tables.open_file('1M_neurons_filtered_gene_bc_matrices_h5.h5')
matrix = sp_sparse.csc_matrix((f.root.mm10.data[:],
                               f.root.mm10.indices[:],
                               f.root.mm10.indptr[:]),
                              shape=f.root.mm10.shape[:])
matrix = matrix.todense()
zarr.array(matrix, store='1m.zarr', overwrite=True)

However, this fails because scipy cannot convert sparse matrices with more than 2**31 non-zero elements to dense (https://github.com/scipy/scipy/issues/7230).

So I was wondering whether it would be possible to convert a sparse matrix to zarr without converting it first to dense array.

If I remove the dense conversion line (matrix = matrix.todense()), it fails with the following exception:

Traceback (most recent call last):
  File "convert.py", line 23, in <module>
    zarr.array(matrix, store='1m.zarr', overwrite=True)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/creation.py", line 311, in array
    z[:] = data
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 619, in __setitem__
    value[value_selection])
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 693, in _chunk_setitem
    self._chunk_setitem_nosync(cidx, item, value)
  File "/home/g/miniconda3/lib/python3.5/site-packages/zarr/core.py", line 735, in _chunk_setitem_nosync
    chunk = np.ascontiguousarray(value, dtype=self._dtype)
  File "/home/g/miniconda3/lib/python3.5/site-packages/numpy/core/numeric.py", line 620, in ascontiguousarray
    return array(a, dtype, copy=False, order='C', ndmin=1)
ValueError: setting an array element with a sequence.
Closing remaining open files:1M_neurons_filtered_gene_bc_matrices_h5.h5...done

Here is the URL to the HDF file for full reproducibility: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5

This issue can also be reproduced easily via the following code:

import numpy as np
import scipy.sparse as sp
import zarr

x = sp.csc_matrix((3, 4), dtype=np.int8)
y = zarr.array(x)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
jakirkhamcommented, Jun 19, 2019

cc @ryan-williams (in case this and/or the xrefs are of interest)

1reaction
jakirkhamcommented, Feb 14, 2018

Going to close as Zarr 2.2.0rc3 supports the strategy outlined in the comment above. Guessing this solves your issue @gokceneraslan, so will close this out. However if that is not the case, please feel free to let us know and we can reopen and discuss further. Thanks for the report.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tutorial — zarr 2.13.3 documentation - Read the Docs
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and...
Read more >
Release notes — zarr 0.1.dev52 documentation
Zarr arrays now support NumPy-style fancy indexing with arrays of integer coordinates. This is equivalent to using zarr.Array.vindex.
Read more >
The Array class (zarr.core) — zarr 2.13.3 documentation
Object used to synchronize write access to the array. filters. One or more codecs used to transform data prior to compression. attrs.
Read more >
Release notes — zarr 2.11.0 documentation
Sparse changes with performance impact! One of the advantages of the Zarr format is that it is sparse, which means that chunks with...
Read more >
Scientific python and sparse arrays (scipy summary + future ...
For example, when performing matrix multiplication, one has to multiply elements together pairwise and then sum up those that contribute to the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found