question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RFC] Support for int64 indexed SciPy sparse matrices in Cython code

See original GitHub issue

At the moment we do not have systematic support for very large sparse matrices in our Cython code. That would be useful when the data is passed as a sparse matrix with more than ~2e9 columns or non-zero values.

The purpose of this issue is to link:

  • reference all related issues in scikit-learn.
  • decide if we want to have some uniform support guarantees or not
  • decide if we need centralized Cython tooling (e.g. type declarations, tempita conventions) to add support for such matrices.

Related issues and PRs (feel free to update this list):

For polynomial feature expansion (quite popular request):

Other models with open issues:

Other Cython estimators that could also be updated:

  • neighbors models (k-NN and radius-based models)
    • related issues not just about this problem: #23604
  • k-means & variants
  • Feature Hasher / Hashing Vectorizer (sklearn/feature_extraction/_hashing_fast.pyx)

Helpful Python snippet

SciPy decides to use the int32 or int64 dtype depending on the dimensions of the matrix and on the number of stored non-zero elements. Here is a quick way to generate a CSR matrix that requires int64-typed .indices and `.indptr attributes:

>>> from scipy.sparse import csr_matrix
>>> import numpy as np
>>>
>>> X = csr_matrix(([1.0], [np.iinfo(np.int32).max + 1], [0, 1]))
>>> X
<1x2147483649 sparse matrix of type '<class 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Row format>
>>> X.indices
array([2147483648])
>>> X.indices.dtype
dtype('int64')
>>> X.indptr.dtype
dtype('int64')

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:5
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Micky774commented, Jun 16, 2022

#1680 seems unrelated.

Typo on my part, corrected now – sorry 😅

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scientific python and sparse arrays (scipy summary + future ...
[RFC] Support for int64 indexed SciPy sparse matrices in Cython code. opened 02:45PM - 16 Jun 22 UTC. ogrisel. API Needs Decision cython....
Read more >
scipy.sparse.csr_matrix — SciPy v1.9.3 Manual
Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power. Advantages of the CSR ...
Read more >
int8 scipy sparse matrix creation errors creating int64 structure?
This answer is a work in progress. Sparse matrices are space and time efficient when they are sparse. Rough experience suggests that 10% ......
Read more >
SciPy Reference Guide - Index of /
2008: scipy.spatial module and first Cython code added ... #4917: BUG: indexing error for sparse matrix with ix_.
Read more >
Cython for NumPy users
Cython is a compiler which compiles Python-like code files to C code. ... not yet supported, though making Cython compile all Python code...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found