performance of loc and iloc
See original GitHub issueimport pandas as pd
import numpy as np
import time
# print(pd.show_versions())
# create the columns and rows of the df
assets = np.array(range(3400))
dates = np.array(range(5000))
df_bool = pd.DataFrame(False, index=dates, columns=assets)
# create the array from the df
a = df_bool.to_numpy()
valid_since = {asset: np.random.choice(dates, size=1)[0] for asset in assets}
h = time.time()
for asset in assets:
start = valid_since[asset]
df_bool.loc[start].loc[asset] = True
print(time.time() - h)
h = time.time()
for asset in assets:
start = valid_since[asset]
df_bool.iloc[start].iloc[asset] = True
print(time.time() - h)
h = time.time()
for asset in assets:
start = valid_since[asset]
df_bool.loc[start, asset] = True
print(time.time() - h)
h = time.time()
for asset in assets:
start = valid_since[asset]
df_bool.iloc[start, asset] = True
print(time.time() - h)
h = time.time()
for asset in assets:
start = valid_since[asset]
a[start, asset] = True
print(time.time() - h)
Problem description
Hi, happy new year xD. I have a problem with the performance using the loc and iloc when it has the comma, for instance the df.iloc [1000, 1000] takes more time in the code of example that df.iloc [1000] .iloc [1000] and I don’t understand why this is happen, and Why the access of df.loc [1000, 1000] is so slow, the code take a lot of time when use this part. Note: All in the code use direct accessing is not doing any kind of slicing or taking multiple assets or dates at the same time it only needs to update one cells at time.
INSTALLED VERSIONS
commit : None python : 3.7.4.final.0 python-bits : 32 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None
pandas : 0.25.0 numpy : 1.17.0 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.3 setuptools : 40.8.0 Cython : None pytest : 5.0.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.1.8 lxml.etree : 4.4.1 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: 0.8.1 bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.1.8 None
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
It looks like this has been fixed on
master
; using the full 3400 * 5000 input I get:And for the 500 * 500 input on
master
I get:Using 0.25.3 I can’t get 3400 * 5000 to run in a reasonable time, and for the 500 * 500 input I get:
Not sure which commit fixes the performance issue; nothing stands out after looking at the merged PRs with the “Performance” label and 1.0 milestone. Could maybe add an asv benchmark for this if one doesn’t exist but otherwise looks like this is already fixed for the upcoming 1.0 release.
Hi - please could you include pd.show_versions() as per above? You can also remove the template boilerplate.