question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices?

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy.sparse import csr_matrix
mil = 1000000
big_csr_diag_1s = csr_matrix((mil, mil), dtype="float")
# Following line takes around 15 seconds to run
big_csr_diag_1s.setdiag(1)
# At this point, big_csr_diag_1s is just a completely-sparse matrix with the only
# nonzero values being values of 1 on its diagonal (and there are 1 million of
# these values; I don't think this should be *too* bad to store in a sparse data
# structure).
# The following line runs for at least 5 minutes (I killed it after that point):
pd.DataFrame.sparse.from_spmatrix(big_csr_diag_1s)

Problem description

It seems like the scipy csr matrix is being converted to dense somewhere in pd.DataFrame.sparse.from_spmatrix(), which results in that function taking a large amount of time (on my laptop, at least).

I think this seems indicative of an efficiency problem, but if constructing the sparse DataFrame in this way really is expected to take a huge amount of time then I can close this issue. Thanks!

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.8.1.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
jorisvandenbosschecommented, Mar 18, 2020

The current from_spmatrix implementation is indeed not very efficient, and I think there a lot of room for improvement. It will never be super fast (as we need to create many 1D sparse arrays for all columns), but I think we can easily get a 5-10x improvement.

Quick experiment:

def convert_scipy_sparse(X): 
    X2 = X.tocsc() 
    n_rows, n_columns = X2.shape 
    data = X2.data 
    indices = X2.indices 
    indptr = X2.indptr 
    dtype = pd.SparseDtype("float64", 0) 
    arrays = [] 
    for i in range(n_columns): 
        index = pd.core.arrays.sparse.IntIndex(n_rows, indices[indptr[i]:indptr[i+1]]) 
        arr = pd.core.arrays.sparse.SparseArray._simple_new(data[indptr[i]:indptr[i+1]], index, dtype) 
        arrays.append(arr) 
    return pd.DataFrame._from_arrays(arrays, columns=pd.Index(range(n_columns)), index=pd.Index(range(n_rows)))        

together with disabling unnecessary validation of the arrays and consolidation in _from_arrays (we should have a fastpath there), gives me a 5x speedup for 10k x 10k sparse matrix

2reactions
rthcommented, Mar 18, 2020

I’ll make a PR @jorisvandenbossche .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficiently create sparse pivot tables in pandas? - Stack Overflow
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are...
Read more >
pandas.DataFrame.sparse.from_spmatrix
Create a new DataFrame from a scipy sparse matrix. New in version 0.25.0. Parameters ... Row and column labels to use for the...
Read more >
Working with sparse data sets in pandas and sklearn
Handling a sparse matrix as a dense one is frequently inefficient, making excessive use of memory. When working with sparse matrices it is ......
Read more >
How to create a big data frame in Python
You can use pandas.Dataframe.sparse.from_spmatrix . It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix ...
Read more >
Sparse and Dense Matrix Classes and Methods - R Project
determined from repr; very often Csparse matrices are more efficient subse- quently, but not always. Value a sparse matrix (of class ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found