RFC: Use `np.int64` by default for CSR matrices' `indices` and `indptr`
See original GitHub issueContext
When creating a scipy.sparse.csr_matrix
(for instance using a numpy array), the underlying indices
and indptr
arrays either use np.int32
or np.int64
depending on nnz
, the number of non zeros elements (namely int64
are used when nnz >= 2**31
):
This mechanism has been introduced in https://github.com/scipy/scipy/pull/442 to support np.int64
while conserving the previous implementation which was using np.int32
when possible based on discussions among maintainers. It has latter been adapted and propagated in sparse matrices’ creation via other contributions, notably https://github.com/scipy/scipy/pull/3468, https://github.com/scipy/scipy/pull/4678/.
Problem
Unfortunately, some algorithms which directly work on CSR arrays necessitate typing indices
and indptr
. This is for instance the case of some algorithms implemented in Cython in scikit-learn.
Supporting both np.int32
and np.int64
comes with complex code adaptation (and potentially performance regressions) that downstream libraries might not want to maintain (this is the case of scikit-learn for instance).
Moreover, this restrict some algorithms working on sparse matrices in SciPy (namely scipy.sparse.csgraph
’s, implemented in Cython) to only work on small problems.
Ideally, the dtype of indices
and indptr
could be set at CSR matrices’ creation, but this is not controllable by downstream libraries. The only possible control is to cast indices
and indptr
to use the dtype chosen by the downstream library and hence create a copy of them.
Proposed solution
The creation of SciPy CSR matrices could be changed to use int64
by default for indices
and indptr
while still being able to specify using int32
if needed. This would allow avoiding to copy and have a better range representation and the cost of a slightly larger memory footprint for CSR matrices for with nnz < 2**31
.
Also np.uint
s could be used over np.uint
because indices and their pointers are positive numbers. This would allow saving memory (as unsigned integers takes twice as less memory than integers) but would also corrupt serialized sparse matrices.
Past discussions in https://github.com/scipy/scipy/pull/442, https://github.com/scipy/scipy/pull/3468 and https://github.com/scipy/scipy/pull/4678/ somewhat mentioned part of the first solution, but I think making this choice might break someone’s else usage of sparse matrices.
Before doing any UX and adaptation work, I think it is worth imagining potential scenarios beyond SciPy and (scikit-learn) where this solution is inadequate and I would welcome @pv, @wnbell and @rgommers’s experience and expertise on sparse matrices (not required, not a urgency).
I am willing to work on a solution.
References
Relevant issues and discussions linked to this issue not exhaustively include:
Issue Analytics
- State:
- Created a year ago
- Reactions:5
- Comments:13 (13 by maintainers)
I’m open to making more broad ranging changes for the new sparse array types, as the need for backward compatibility there is as yet minimal.
For sparse matrices, changing the default index dtype is more difficult. We definitely have users for which 64-bit indices would be a breaking change due to the increased RAM requirement. Removing the down casting behavior should be doable, though.
I think there’s two separate topics: what’s the better design, and what should we actually do given backwards compatibility constraints.
Regarding the design, I agree with @jjerphan that the current behavior is problematic. I think there’'s a by now well-established consensus that it’s a bad idea to have output dtype depend on input values rather than input dtypes (for multiple reasons, from code complexity to predictable behavior for users to issues for a JIT compiler like Numba to support a feature).
Regarding backwards compatibility, it’s less clear whether we should only add a new option or whether we should try to change the default behavior. I’m not yet convinced one way or the other. Let’s see what others think. @perimosocordiae do you have an opinion on this?
Oof, that does look bad indeed.
Yes, this should improve, but that’s not a fast process (see NEP 50).