Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices?

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
from scipy.sparse import csr_matrix
mil = 1000000
big_csr_diag_1s = csr_matrix((mil, mil), dtype="float")
# Following line takes around 15 seconds to run
big_csr_diag_1s.setdiag(1)
# At this point, big_csr_diag_1s is just a completely-sparse matrix with the only
# nonzero values being values of 1 on its diagonal (and there are 1 million of
# these values; I don't think this should be *too* bad to store in a sparse data
# structure).
# The following line runs for at least 5 minutes (I killed it after that point):
pd.DataFrame.sparse.from_spmatrix(big_csr_diag_1s)

Problem description

It seems like the scipy csr matrix is being converted to dense somewhere in pd.DataFrame.sparse.from_spmatrix(), which results in that function taking a large amount of time (on my laptop, at least).

I think this seems indicative of an efficiency problem, but if constructing the sparse DataFrame in this way really is expected to take a huge amount of time then I can close this issue. Thanks!

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.8.1.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

4reactions

jorisvandenbosschecommented, Mar 18, 2020

The current from_spmatrix implementation is indeed not very efficient, and I think there a lot of room for improvement. It will never be super fast (as we need to create many 1D sparse arrays for all columns), but I think we can easily get a 5-10x improvement.

Quick experiment:

def convert_scipy_sparse(X): 
    X2 = X.tocsc() 
    n_rows, n_columns = X2.shape 
    data = X2.data 
    indices = X2.indices 
    indptr = X2.indptr 
    dtype = pd.SparseDtype("float64", 0) 
    arrays = [] 
    for i in range(n_columns): 
        index = pd.core.arrays.sparse.IntIndex(n_rows, indices[indptr[i]:indptr[i+1]]) 
        arr = pd.core.arrays.sparse.SparseArray._simple_new(data[indptr[i]:indptr[i+1]], index, dtype) 
        arrays.append(arr) 
    return pd.DataFrame._from_arrays(arrays, columns=pd.Index(range(n_columns)), index=pd.Index(range(n_rows)))

together with disabling unnecessary validation of the arrays and consolidation in _from_arrays (we should have a fastpath there), gives me a 5x speedup for 10k x 10k sparse matrix

2reactions

rthcommented, Mar 18, 2020

I’ll make a PR @jorisvandenbossche .