PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64
See original GitHub issueRe-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).
"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds
pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""
import pandas as pd
import time
import numpy as np
if __name__ == '__main__':
n_days = 300
dr = pd.date_range(end="20181118", periods=n_days)
mi = pd.MultiIndex.from_product([dr, range(1440)])
v = np.random.randn(len(mi))
mask = np.random.rand(len(v)) < .3
v[mask] = np.nan
s = pd.Series(v, index=mi)
s = s.sort_index()
s2 = s.dropna()
start = time.time()
s2.reindex(index=s.index)
end = time.time()
print("pandas version: %s" % pd.__version__)
print("reindex took %s seconds" % (end - start))
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:17 (8 by maintainers)
Top Results From Across the Web
pandas.DataFrame.reindex — pandas 0.22.0 documentation
Conform DataFrame to new index with optional filling logic, ... Broadcast across a level, matching Index values on the passed MultiIndex level.
Read more >Pandas MultiIndex single level look up is much slower than ...
It seems like there is some performance regression from version 1.0.5 to 1.1.0. Sorting the index explicitly seems to increase the the indexing ......
Read more >76. Pandas for Panel Data - Quantitative Economics with Python
We will begin by reading in our long format panel data from a CSV file and reshaping the resulting DataFrame with pivot_table to...
Read more >Filtering or Querying Pandas MultiIndex Dataframe based on ...
I have a multi-index pandas DataFrame such as below, primarily indexed with DateTime object. >>> type(feed_tail) <class 'pandas.core.frame.
Read more >Hierarchical Indexing | Python Data Science Handbook
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@toobaz I did some further investigation and find out that
_extract_level_codes
is the other major cause of the performance regression.A minimal example:
The result :
So
extract_level_codes
almost takes the same amount of time asget_indexer_for
! To be more specific, it is the following line which is causing the performance regression: https://github.com/pandas-dev/pandas/blob/fd67546153ac6a5685d1c7c4d8582ed1a4c9120f/pandas/_libs/index.pyx#L602So the conlusion is that:
extract_level_codes
contributes to other 10X speed difference.This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.