SparseDataFrame loc multiple index is slow
See original GitHub issueIn [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 0, 0],[0,0,0]])
# non sparse
In [19]: %time df.loc[0]
CPU times: user 367 µs, sys: 14 µs, total: 381 µs
Wall time: 374 µs
Out[19]:
0 1
1 0
2 0
Name: 0, dtype: int64
In [20]: %time df.loc[[0]]
CPU times: user 808 µs, sys: 57 µs, total: 865 µs
Wall time: 827 µs
Out[20]:
0 1 2
0 1 0 0
In [3]: df = df.to_sparse()
# sparse
In [5]: %time df.loc[0]
CPU times: user 429 µs, sys: 14 µs, total: 443 µs
Wall time: 436 µs
Out[5]:
0 1.0
1 0.0
2 0.0
Name: 0, dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
In [6]: %time df.loc[[0]]
CPU times: user 4.07 ms, sys: 406 µs, total: 4.48 ms
Wall time: 6.16 ms
Out[6]:
0 1 2
0 1.0 0.0 0.0
It seems in SparseDataFrame, selecting as dataframe, the time increased much more than it does in DataFrame, I would expect it runs faster(on my 1m x 500 SparseDataFrame it takes seconds for the loc operation)
commit: None python: 3.5.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_GB.UTF-8 LANG: en_GB.UTF-8
pandas: 0.18.1 nose: None pip: 8.1.2 setuptools: 24.0.2 Cython: None numpy: 1.11.0 scipy: 0.17.1 statsmodels: None xarray: None IPython: 5.0.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.13 pymysql: None psycopg2: 2.6.1 (dt dec pq3 ext lo64) jinja2: None boto: None pandas_datareader: None
Issue Analytics
- State:
- Created 7 years ago
- Comments:10 (6 by maintainers)
I guess. Why would this small difference actually matter in practice? If you are repeatedly using an indexer that is completely non-performant (in either case), and you should be indexing via a larger indexer.
This affects DataFrame[sparse] as well, but I don’t think there’s an obvious fix. We aren’t planning to adopt arbitrary 2D extension arrays, so this will always be slower.
Would certainly welcome performance improvements given the current design though.