question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

loc very slow on sorted, non-unique index with list of labels ar argument

See original GitHub issue
In [1]: import pandas, numpy

In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))

In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop

In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop

… makes sense to me.

In [5]: df.index = list(range(99999)) + [55555]

In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop

In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop

Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things…

In [8]: df.sort(inplace=True)

In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop

In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop

… here I’m lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,

In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]: 
          0     1     2     3
55555  True  True  True  True
55555  True  True  True  True

(As a sidenote: the reason why I’m doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Reactions:2
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
jrebackcommented, Feb 11, 2015

a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)

In [20]: %timeit df3.loc[[(55555,99999)]]
1000 loops, best of 3: 417 us per loop

In [21]: %timeit df3.loc[[(55555,99999),(99998,99998)]]
1000 loops, best of 3: 432 us per loop
1reaction
toobazcommented, Feb 11, 2015

Thanks for the clarification. Indeed, I wouldn’t have expected even such a stupid multi-level index…

In  [13]: df3 = df.reset_index().reset_index().set_index(["level_0", "index"])

to be substantially faster than a non-unique one!

In  [14]: %timeit df3.loc[[55555]]
Out [14]: 100 loops, best of 3: 3.23 ms per loop

(although still orders of magnitude slower than df.loc[df.index == 55555])

By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot get right slice bound for non-unique label when ...
The error message is explained here: if the index is not monotonic, then both slice bounds must be unique members of the index...
Read more >
How to find duplicate data in python. Count the ... - Unoeste
How to find duplicate data in python. Count the number of occurrence in of that elements in array and check if it greater...
Read more >
vegan: Community Ecology Package
2003. It is based on species richness (S, not S - 1), Shannon's and Simpson's diversity indices stated as the index argument.
Read more >
MEDITECH Expanse Performance Related Settings (All ...
This document lists the se ngs located in the User Preferences, ... Performance related parameters are denoted with a [Performance] label and include ......
Read more >
Package 'utils' reference manual - R-universe
a character vector with the names of packages to search through, or NULL in which "all" packages (as defined by argument all )...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found