Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: regression in reindex. Pandas 0.23.4 is 60x slower than 0.22.0 with a MultiIndex with datetime64

See original GitHub issue

Re-indexing a series with a two-level MultiIndex where the first level is datetime64 and the second level is int is 40x slower than in 0.22.0. Output first then repro code below. The issue persists if you change the first level to int instead of datetime, but the perf difference is less (0.40 seconds vs 0.03 seconds).


"""
pandas version: 0.23.4
reindex took 1.9770500659942627 seconds

pandas version: 0.22.0
reindex took 0.0306899547577 seconds
"""


import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 300
    dr = pd.date_range(end="20181118", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()

    s2.reindex(index=s.index)

    end = time.time()
    print("pandas version: %s" % pd.__version__)
    print("reindex took %s seconds" % (end - start))

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:17 (8 by maintainers)

Top GitHub Comments

1reaction

qiuweicommented, Apr 22, 2021

@toobaz I did some further investigation and find out that _extract_level_codes is the other major cause of the performance regression.

A minimal example:

import pandas as pd
import time
import numpy as np


if __name__ == '__main__':
    n_days = 2500
    dr = pd.date_range(end="20120101", periods=n_days)
    mi = pd.MultiIndex.from_product([dr, range(1440)])

    v = np.random.randn(len(mi))
    mask = np.random.rand(len(v)) < .3
    v[mask] = np.nan
    s = pd.Series(v, index=mi)
    s = s.sort_index()

    s2 = s.dropna()

    start = time.time()


    match_seq = s2.index.get_indexer_for(s.index)
    # result = s2.index.values
    # resul2 = s.index.values

    end1 = time.time()
    match_seq = s2.index.get_indexer_for(s.index)
    end2 = time.time()

    s2.index._engine._extract_level_codes(s.index)
    end3 = time.time()
    # print(s2)
    print("pandas version: %s" % pd.__version__)
    print("reindex for the first time(include time cost of populating the hash mapping) took %s seconds" % (end1 - start))
    print("reindex with mapping populated took %s seconds" % (end2 - end1))
    print("extrace level codes takes %s seconds" % (end3 - end2))

The result :

pandas version: 1.1.4
reindex for the first time(include of time cost of populating the hash mapping) took 5.976720809936523 seconds
reindex with mapping populated took 3.858426332473755 seconds
extract level codes takes 3.790076732635498 seconds

So extract_level_codes almost takes the same amount of time as get_indexer_for! To be more specific, it is the following line which is causing the performance regression: https://github.com/pandas-dev/pandas/blob/fd67546153ac6a5685d1c7c4d8582ed1a4c9120f/pandas/_libs/index.pyx#L602

So the conlusion is that:

conversion from datatime to object contributes to the 6X speed difference (This can be easily fixed if ed15d8e can be reverted)
inefficiency of extract_level_codes contributes to other 10X speed difference.

1reaction

qiuweicommented, Mar 10, 2021

This issue still persists with the latest 1.2.3 version and reindexing seems to get even slower.

pandas version: 1.2.3
reindex took 2.6638526916503906 seconds

Top Results From Across the Web

pandas.DataFrame.reindex — pandas 0.22.0 documentation

Conform DataFrame to new index with optional filling logic, ... Broadcast across a level, matching Index values on the passed MultiIndex level.

Pandas MultiIndex single level look up is much slower than ...

It seems like there is some performance regression from version 1.0.5 to 1.1.0. Sorting the index explicitly seems to increase the the indexing ......

76. Pandas for Panel Data - Quantitative Economics with Python

We will begin by reading in our long format panel data from a CSV file and reshaping the resulting DataFrame with pivot_table to...

Filtering or Querying Pandas MultiIndex Dataframe based on ...

I have a multi-index pandas DataFrame such as below, primarily indexed with DateTime object. >>> type(feed_tail) <class 'pandas.core.frame.

Hierarchical Indexing | Python Data Science Handbook

Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have....