loc very slow on sorted, non-unique index with list of labels ar argument
See original GitHub issueIn [1]: import pandas, numpy
In [2]: df = pandas.DataFrame(numpy.random.random((100000, 4)))
In [3]: %timeit df.loc[55555]
10000 loops, best of 3: 118 µs per loop
In [4]: %timeit df.loc[[55555]]
1000 loops, best of 3: 324 µs per loop
… makes sense to me.
In [5]: df.index = list(range(99999)) + [55555]
In [6]: %timeit df.loc[55555]
100 loops, best of 3: 4.04 ms per loop
In [7]: %timeit df.loc[[55555]]
100 loops, best of 3: 16.8 ms per loop
Non-unique index, slower (the second call probably has to scan all the index): still makes sense to me. Sorting should improve things…
In [8]: df.sort(inplace=True)
In [9]: %timeit df.loc[55555]
1000 loops, best of 3: 239 µs per loop
In [10]: %timeit df.loc[[55555]]
100 loops, best of 3: 17.2 ms per loop
… here I’m lost: why this huge difference? The difference is even larger (3 orders of magnitude) in a real database I am working on. Clearly,
In [12]: df.loc[[55555]] == df.loc[55555]
Out[12]:
0 1 2 3
55555 True True True True
55555 True True True True
(As a sidenote: the reason why I’m doing calls such as df.loc[[a_label]] is that df.loc[a_label] will return sometimes a Series, sometimes a DataFrame. I currently solve this by using df.loc[df.index == a_label], which is however ~3x slower than df.loc[a_label] - but much faster than the above df.loc[[a_label]].)
Issue Analytics
- State:
- Created 9 years ago
- Reactions:2
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Cannot get right slice bound for non-unique label when ...
The error message is explained here: if the index is not monotonic, then both slice bounds must be unique members of the index...
Read more >How to find duplicate data in python. Count the ... - Unoeste
How to find duplicate data in python. Count the number of occurrence in of that elements in array and check if it greater...
Read more >vegan: Community Ecology Package
2003. It is based on species richness (S, not S - 1), Shannon's and Simpson's diversity indices stated as the index argument.
Read more >MEDITECH Expanse Performance Related Settings (All ...
This document lists the se ngs located in the User Preferences, ... Performance related parameters are denoted with a [Performance] label and include ......
Read more >Package 'utils' reference manual - R-universe
a character vector with the names of packages to search through, or NULL in which "all" packages (as defined by argument all )...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
a multi-index is like an index of indexes, so if each is unique it uses the optimized lookups. FYI, the difference between 1ms and 100us is just a few function calls (e.g. the MI has to do more inference on what exactly you are looking)
Thanks for the clarification. Indeed, I wouldn’t have expected even such a stupid multi-level index…
to be substantially faster than a non-unique one!
(although still orders of magnitude slower than df.loc[df.index == 55555])
By the way: I know df.loc[indexer] will return a DataFrame if you have duplicates. But I would find it more elegant/useful if the distinction was made at the DataFrame level (i.e. if not self.index.is_unique, then a DataFrame is returned even for non-duplicated labels). I may certainly be overlooking tons of feasibility/backward compatibility issues however.