DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices?
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
from scipy.sparse import csr_matrix
mil = 1000000
big_csr_diag_1s = csr_matrix((mil, mil), dtype="float")
# Following line takes around 15 seconds to run
big_csr_diag_1s.setdiag(1)
# At this point, big_csr_diag_1s is just a completely-sparse matrix with the only
# nonzero values being values of 1 on its diagonal (and there are 1 million of
# these values; I don't think this should be *too* bad to store in a sparse data
# structure).
# The following line runs for at least 5 minutes (I killed it after that point):
pd.DataFrame.sparse.from_spmatrix(big_csr_diag_1s)
Problem description
It seems like the scipy csr matrix is being converted to dense somewhere in pd.DataFrame.sparse.from_spmatrix()
, which results in that function taking a large amount of time (on my laptop, at least).
I think this seems indicative of an efficiency problem, but if constructing the sparse DataFrame in this way really is expected to take a huge amount of time then I can close this issue. Thanks!
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.8.1.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-76-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.15 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
The current
from_spmatrix
implementation is indeed not very efficient, and I think there a lot of room for improvement. It will never be super fast (as we need to create many 1D sparse arrays for all columns), but I think we can easily get a 5-10x improvement.Quick experiment:
together with disabling unnecessary validation of the arrays and consolidation in
_from_arrays
(we should have a fastpath there), gives me a 5x speedup for 10k x 10k sparse matrixI’ll make a PR @jorisvandenbossche .