[RFC] Support for int64 indexed SciPy sparse matrices in Cython code
See original GitHub issueAt the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.
The purpose of this issue is to link:
- reference all related issues in scikit-learn.
- decide if we want to have some uniform support guarantees or not
- decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.
Related issues and PRs (feel free to update this list):
For polynomial feature expansion (quite popular request):
Other models with open issues:
Other Cython estimators that could also be updated:
- neighbors models (k-NN and radius-based models)
- related issues not just about this problem: #23604
- k-means & variants
- Feature Hasher / Hashing Vectorizer (
sklearn/feature_extraction/_hashing_fast.pyx
)
Helpful Python snippet
SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices
and `.indptr attributes:
>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')
Issue Analytics
- State:
- Created a year ago
- Reactions:5
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Scientific python and sparse arrays (scipy summary + future ...
[RFC] Support for int64 indexed SciPy sparse matrices in Cython code. opened 02:45PM - 16 Jun 22 UTC. ogrisel. API Needs Decision cython....
Read more >scipy.sparse.csr_matrix — SciPy v1.9.3 Manual
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. Advantages of the CSR ...
Read more >int8 scipy sparse matrix creation errors creating int64 structure?
This answer is a work in progress. Sparse matrices are space and time efficient when they are sparse. Rough experience suggests that 10% ......
Read more >SciPy Reference Guide - Index of /
2008: scipy.spatial module and first Cython code added ... #4917: BUG: indexing error for sparse matrix with ix_.
Read more >Cython for NumPy users
Cython is a compiler which compiles Python-like code files to C code. ... not yet supported, though making Cython compile all Python code...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Typo on my part, corrected now – sorry 😅
For reference, some relevant discussions on the Scientific Python Discourse regarding sparse arrays status and future directions.