Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: Use `np.int64` by default for CSR matrices' `indices` and `indptr`

See original GitHub issue

Context

When creating a scipy.sparse.csr_matrix (for instance using a numpy array), the underlying indices and indptr arrays either use np.int32 or np.int64 depending on nnz, the number of non zeros elements (namely int64 are used when nnz >= 2**31):

https://github.com/scipy/scipy/blob/935d537feeb5ff2940f50de75bda01dd21350cb9/scipy/sparse/_sputils.py#L134-L185

This mechanism has been introduced in https://github.com/scipy/scipy/pull/442 to support np.int64 while conserving the previous implementation which was using np.int32 when possible based on discussions among maintainers. It has latter been adapted and propagated in sparse matrices’ creation via other contributions, notably https://github.com/scipy/scipy/pull/3468, https://github.com/scipy/scipy/pull/4678/.

Problem

Unfortunately, some algorithms which directly work on CSR arrays necessitate typing indices and indptr. This is for instance the case of some algorithms implemented in Cython in scikit-learn.

Supporting both np.int32 and np.int64 comes with complex code adaptation (and potentially performance regressions) that downstream libraries might not want to maintain (this is the case of scikit-learn for instance).

Moreover, this restrict some algorithms working on sparse matrices in SciPy (namely scipy.sparse.csgraph’s, implemented in Cython) to only work on small problems.

Ideally, the dtype of indices and indptr could be set at CSR matrices’ creation, but this is not controllable by downstream libraries. The only possible control is to cast indices and indptr to use the dtype chosen by the downstream library and hence create a copy of them.

Proposed solution

The creation of SciPy CSR matrices could be changed to use int64 by default for indices and indptr while still being able to specify using int32 if needed. This would allow avoiding to copy and have a better range representation and the cost of a slightly larger memory footprint for CSR matrices for with nnz < 2**31.

Also np.uints could be used over np.uint because indices and their pointers are positive numbers. This would allow saving memory (as unsigned integers takes twice as less memory than integers) but would also corrupt serialized sparse matrices.

Past discussions in https://github.com/scipy/scipy/pull/442, https://github.com/scipy/scipy/pull/3468 and https://github.com/scipy/scipy/pull/4678/ somewhat mentioned part of the first solution, but I think making this choice might break someone’s else usage of sparse matrices.

Before doing any UX and adaptation work, I think it is worth imagining potential scenarios beyond SciPy and (scikit-learn) where this solution is inadequate and I would welcome @pv, @wnbell and @rgommers’s experience and expertise on sparse matrices (not required, not a urgency).

I am willing to work on a solution.

References

Relevant issues and discussions linked to this issue not exhaustively include:

Issue Analytics

State:
Created a year ago
Reactions:5
Comments:13 (13 by maintainers)

Top GitHub Comments

2reactions

perimosocordiaecommented, Aug 10, 2022

I’m open to making more broad ranging changes for the new sparse array types, as the need for backward compatibility there is as yet minimal.

For sparse matrices, changing the default index dtype is more difficult. We definitely have users for which 64-bit indices would be a breaking change due to the increased RAM requirement. Removing the down casting behavior should be doable, though.

2reactions

rgommerscommented, Aug 9, 2022

I think there’s two separate topics: what’s the better design, and what should we actually do given backwards compatibility constraints.

Regarding the design, I agree with @jjerphan that the current behavior is problematic. I think there’'s a by now well-established consensus that it’s a bad idea to have output dtype depend on input values rather than input dtypes (for multiple reasons, from code complexity to predictable behavior for users to issues for a JIT compiler like Numba to support a feature).

Regarding backwards compatibility, it’s less clear whether we should only add a new option or whether we should try to change the default behavior. I’m not yet convinced one way or the other. Let’s see what others think. @perimosocordiae do you have an opinion on this?

I think the main work to be done is removing steps which automatically downcast the array.

Oof, that does look bad indeed.

The biggest downside being numpy’s type promotion rules for unsigned integers being unintuitive. Though I thought I heard this might change?

Yes, this should improve, but that’s not a fast process (see NEP 50).

Top Results From Across the Web

[RFC] Support for int64 indexed SciPy sparse matrices in ...

SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero...

scipy.sparse.csr_matrix — SciPy v1.9.3 Manual

Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. Advantages of the CSR ...

Scientific python and sparse arrays (scipy summary + future ...

The main use cases for sparse data were Linear algebra and ODE related ... ENH: Use `np.int64` by default for CSR matrices' `indptr`....

Scipy Sparse Matrix built with non-int64 (indptr, indices) for dot

The x_csr.data is an array of 1. Scipy doesn't let me to use a single number to replace the whole x_csr.data array.

Compressed Sparse Row Format (CSR) - Scipy Lecture Notes

three NumPy arrays: indices , indptr , data. indices is array of column indices; data is array of corresponding nonzero values; indptr points...