question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

performance of loc and iloc

See original GitHub issue
import pandas as pd 
import numpy as np
import time

# print(pd.show_versions())

# create the columns and rows of the df
assets = np.array(range(3400))
dates = np.array(range(5000))
df_bool = pd.DataFrame(False, index=dates, columns=assets)

# create the array from the df
a = df_bool.to_numpy()

valid_since = {asset: np.random.choice(dates, size=1)[0] for asset in assets}

h = time.time()
for asset in assets:
	start = valid_since[asset]
	df_bool.loc[start].loc[asset] = True
print(time.time() - h)

h = time.time()
for asset in assets:
	start = valid_since[asset]
	df_bool.iloc[start].iloc[asset] = True

print(time.time() - h)

h = time.time()
for asset in assets:
	start = valid_since[asset]
	df_bool.loc[start, asset] = True 

print(time.time() - h)

h = time.time()
for asset in assets:
	start = valid_since[asset]
	df_bool.iloc[start, asset] = True 

print(time.time() - h)

h = time.time()
for asset in assets:
	start = valid_since[asset]
	a[start, asset] = True 

print(time.time() - h)

Problem description

Hi, happy new year xD. I have a problem with the performance using the loc and iloc when it has the comma, for instance the df.iloc [1000, 1000] takes more time in the code of example that df.iloc [1000] .iloc [1000] and I don’t understand why this is happen, and Why the access of df.loc [1000, 1000] is so slow, the code take a lot of time when use this part. Note: All in the code use direct accessing is not doing any kind of slicing or taking multiple assets or dates at the same time it only needs to update one cells at time.

INSTALLED VERSIONS

commit : None python : 3.7.4.final.0 python-bits : 32 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.0 numpy : 1.17.0 pytz : 2019.2 dateutil : 2.8.0 pip : 19.2.3 setuptools : 40.8.0 Cython : None pytest : 5.0.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.1.8 lxml.etree : 4.4.1 html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: 0.8.1 bs4 : 4.8.0 bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.4.1 matplotlib : 3.1.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None s3fs : None scipy : 1.3.1 sqlalchemy : None tables : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.1.8 None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jschendelcommented, Dec 31, 2019

It looks like this has been fixed on master; using the full 3400 * 5000 input I get:

0.4405019283294678
0.39464664459228516
0.3829975128173828
0.3791060447692871
0.001157522201538086

And for the 500 * 500 input on master I get:

0.06933188438415527
0.06213188171386719
0.055167436599731445
0.053802490234375
0.00021409988403320312

Using 0.25.3 I can’t get 3400 * 5000 to run in a reasonable time, and for the 500 * 500 input I get:

0.10065627098083496
0.09464192390441895
2.234353542327881
2.2240450382232666
0.0002143383026123047

Not sure which commit fixes the performance issue; nothing stands out after looking at the merged PRs with the “Performance” label and 1.0 milestone. Could maybe add an asv benchmark for this if one doesn’t exist but otherwise looks like this is already fixed for the upcoming 1.0 release.

1reaction
alimcmaster1commented, Dec 31, 2019

Hi - please could you include pd.show_versions() as per above? You can also remove the template boilerplate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas loc vs. iloc vs. at vs. iat? - python - Stack Overflow
One excellent ability of both .loc/.iloc is their ability to select both rows and columns simultaneously. In the examples above, all the columns...
Read more >
loc vs iloc in Pandas and Python | Towards Data Science
loc is used to index a pandas DataFrame or Series using labels. On the other hand, iloc can be used to retrieve records...
Read more >
Compare loc[] vs iloc[] vs at[] vs iat[] with Examples
In this article we will cover different examples to understand the difference between loc[] vs iloc[] and at[] vs iat[] in Python pandas....
Read more >
Poor performance for .loc and .iloc compared to .ix #6683
When using indices, we are encouraged to use .loc instead of .ix. ... But it seems the performance of .loc and .iloc is...
Read more >
loc vs iloc In Pandas For Selecting Data - Analytics Vidhya
loc and iloc in Action (using Pandas in Python) · Create a sample dataset · Find all the rows based on any condition...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found