Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

loc very slow on sorted, non-unique index with list of labels ar argument

See original GitHub issue

In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

… makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things…

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

… here I’m lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I’m doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)

Issue Analytics

State:
Created 9 years ago
Reactions:2
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

jrebackcommented, Feb 11, 2015

a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)

In [20]: %timeit df3.loc[[(55555,99999)]]
1000 loops, best of 3: 417 us per loop

In [21]: %timeit df3.loc[[(55555,99999),(99998,99998)]]
1000 loops, best of 3: 432 us per loop

1reaction

toobazcommented, Feb 11, 2015

Thanks for the clarification. Indeed, I wouldn’t have expected even such a stupid multi-level index…

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"])

to be substantially faster than a non-unique one!

In  [14]: %timeit df3.loc[[55555]]
Out [14]: 100 loops, best of 3: 3.23 ms per loop

(although still orders of magnitude slower than df.loc[df.index == 55555])

By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however.

Top Results From Across the Web

Cannot get right slice bound for non-unique label when ...

The error message is explained here: if the index is not monotonic, then both slice bounds must be unique members of the index...

How to find duplicate data in python. Count the ... - Unoeste

How to find duplicate data in python. Count the number of occurrence in of that elements in array and check if it greater...

vegan: Community Ecology Package

2003. It is based on species richness (S, not S - 1), Shannon's and Simpson's diversity indices stated as the index argument.

MEDITECH Expanse Performance Related Settings (All ...

This document lists the se ngs located in the User Preferences, ... Performance related parameters are denoted with a [Performance] label and include ......

Package 'utils' reference manual - R-universe

a character vector with the names of packages to search through, or NULL in which "all" packages (as defined by argument all )...