Performance regression with duplicate index 0.11.1->0.11.2
See original GitHub issueI am using histplot
for plotting simple histogram plots for medium sized data-sizes.
I was using histplot
for histogram determinations in my dataset using 0.11.1 and all worked very nicely and fast.
Then upon upgrading to 0.11.2 it suddenly never finishes and I am stuck in plotting stuff…
I have a plotting script looking like this:
plt.figure()
sns.histplot(data=df, x="AllocCores", binwidth=1, discrete=True, log_scale=(False, True))
The data-frame has 789412 entries and the span of the data df["AllocCores"].min / max
is 1, 1200
. So 1200 bins which I agree is very large but required in our analysis plots 😉
Using 0.11.1 I get this:
$> time python script.py
seaborn version = 0.11.1
Starting to plot...
Choosing binwidth = 120
Choosing binwidth = 1070
Choosing binwidth = 16
Choosing binwidth = 16
python script.py 40.34s user 0.83s system 155% cpu 26.479 total
note that this is plotting 16 files in less than a minute.
Using 0.11.2 I get this:
$> time python script.py
seaborn version = 0.11.2
Starting to plot...
it basically never ends… I have waited more than 10 minutes, but it isn’t even done with the 1st plot.
When killing it Ctrl+^C
I get this stack-trace:
File "script.py", line 96, in <module>
plot_alloccores(df)
File ".../script_lib.py", line 98, in plot_alloccores
sns.histplot(data=df, x="AllocCores", binwidth=1, discrete=True, log_scale=(False, True))
File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/distributions.py", line 1462, in histplot
p.plot_univariate_histogram(
File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/distributions.py", line 428, in plot_univariate_histogram
for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/_core.py", line 983, in iter_data
data = self.comp_data
File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/_core.py", line 1057, in comp_data
comp_col.loc[orig.index] = pd.to_numeric(axis.convert_units(orig))
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 671, in __setitem__
self._setitem_with_indexer(indexer, value)
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 1055, in _setitem_with_indexer
value = self._align_series(indexer, Series(value))
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 1173, in _align_series
ser = ser.reindex(obj.axes[0][indexer[0]], copy=True)._values
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/series.py", line 4030, in reindex
return super().reindex(index=index, **kwargs)
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/generic.py", line 4543, in reindex
return self._reindex_axes(
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/generic.py", line 4558, in _reindex_axes
new_index, indexer = ax.reindex(
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexes/base.py", line 3153, in reindex
indexer, missing = self.get_indexer_non_unique(target)
File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexes/base.py", line 4486, in get_indexer_non_unique
indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
File "pandas/_libs/index.pyx", line 354, in pandas._libs.index.IndexEngine.get_indexer_non_unique
File "<__array_function__ internals>", line 5, in resize
File "/opt/gnu/9.3.0/python/packages/3.8.5/numpy/1.19.1/lib/python3.8/site-packages/numpy-1.19.1-py3.8-linux-x86_64.egg/numpy/core/fromnumeric.py", line 1417, in resize
a = concatenate((a,) * n_copies)
File "<__array_function__ internals>", line 5, in concatenate
KeyboardInterrupt
I hope that this can provide enough details for finding the regression, if not and you require some kind of data-set, let me know and I’ll try to cook something out. There is some sensitive data that I can’t share. But I may be able to fiddle with the entries if this may aid?
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (6 by maintainers)
Top GitHub Comments
Dropping a search engine signpost to help the next person who runs into this issue. On upgrading from seaborn==0.11.1->0.11.2, running
sns.histplot
on a dataframe with a non-unique index yieldsValueError: cannot reindex from a duplicate axis
.I was able to resolve this by running
df = df.reset_index(drop=True)
which replaces the non-unique index with unique consecutive integers.Thanks @ricardoV94 that’s much more helpful.
I can reproduce your performance issue now. I can also track down the source of it. Your dataframe has a (heavily) duplicated index:
Therefore, if I do:
it solves the problem:
https://github.com/mwaskom/seaborn/pull/2417 introduced the relevant change. Interestingly, this awkward construction
which IIRC is a relic of the initial implementation, is much slower (with a duplicate index) than
That said, duplicate indices are causing problems elsewhere. Duplicate indices are, IMO, a rough edge in pandas. The index is supposed to act like a primary key in a database and uniqueness really ought to be enforced. Pandas explicitly chooses not to by default, and while their reasoning make sense, a consequence is that certain operations fail or (apparently) run many orders of magnitude slower with certain dataframes but not others.
Therefore, it’s strongly recommended for users (of pandas) to reset the index on a dataframe after performing an operation that introduces duplicates.
But ultimately, there needs to be a decision in seaborn about how to handle duplicate indices, and that should happen upstream of this code. The possibilities seem to be: a) raise with an informative error or b) internally reset the index if it has duplicates.
So: