Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance regression with duplicate index 0.11.1->0.11.2

See original GitHub issue

I am using histplot for plotting simple histogram plots for medium sized data-sizes.

I was using histplot for histogram determinations in my dataset using 0.11.1 and all worked very nicely and fast.

Then upon upgrading to 0.11.2 it suddenly never finishes and I am stuck in plotting stuff…

I have a plotting script looking like this:

plt.figure()
sns.histplot(data=df, x="AllocCores", binwidth=1, discrete=True, log_scale=(False, True))

The data-frame has 789412 entries and the span of the data df["AllocCores"].min / max is 1, 1200. So 1200 bins which I agree is very large but required in our analysis plots 😉

Using 0.11.1 I get this:

$> time python script.py
seaborn version = 0.11.1
Starting to plot...
Choosing binwidth = 120
Choosing binwidth = 1070
Choosing binwidth = 16
Choosing binwidth = 16
python script.py  40.34s user 0.83s system 155% cpu 26.479 total

note that this is plotting 16 files in less than a minute.

Using 0.11.2 I get this:

$> time python script.py
seaborn version = 0.11.2
Starting to plot...

it basically never ends… I have waited more than 10 minutes, but it isn’t even done with the 1st plot.

When killing it Ctrl+^C I get this stack-trace:

  File "script.py", line 96, in <module>
    plot_alloccores(df)
  File ".../script_lib.py", line 98, in plot_alloccores
    sns.histplot(data=df, x="AllocCores", binwidth=1, discrete=True, log_scale=(False, True))
  File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/distributions.py", line 1462, in histplot
    p.plot_univariate_histogram(
  File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/distributions.py", line 428, in plot_univariate_histogram
    for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
  File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/_core.py", line 983, in iter_data
    data = self.comp_data
  File "/opt/gnu/9.3.0/python/3.8.5/lib/python3.8/site-packages/seaborn/_core.py", line 1057, in comp_data
    comp_col.loc[orig.index] = pd.to_numeric(axis.convert_units(orig))
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 671, in __setitem__
    self._setitem_with_indexer(indexer, value)
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 1055, in _setitem_with_indexer
    value = self._align_series(indexer, Series(value))
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexing.py", line 1173, in _align_series
    ser = ser.reindex(obj.axes[0][indexer[0]], copy=True)._values
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/series.py", line 4030, in reindex
    return super().reindex(index=index, **kwargs)
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/generic.py", line 4543, in reindex
    return self._reindex_axes(
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/generic.py", line 4558, in _reindex_axes
    new_index, indexer = ax.reindex(
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexes/base.py", line 3153, in reindex
    indexer, missing = self.get_indexer_non_unique(target)
  File "/opt/gnu/9.3.0/python/packages/3.8.5/pandas/1.0.5/lib/python3.8/site-packages/pandas-1.0.5-py3.8-linux-x86_64.egg/pandas/core/indexes/base.py", line 4486, in get_indexer_non_unique
    indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
  File "pandas/_libs/index.pyx", line 354, in pandas._libs.index.IndexEngine.get_indexer_non_unique
  File "<__array_function__ internals>", line 5, in resize
  File "/opt/gnu/9.3.0/python/packages/3.8.5/numpy/1.19.1/lib/python3.8/site-packages/numpy-1.19.1-py3.8-linux-x86_64.egg/numpy/core/fromnumeric.py", line 1417, in resize
    a = concatenate((a,) * n_copies)
  File "<__array_function__ internals>", line 5, in concatenate
KeyboardInterrupt

I hope that this can provide enough details for finding the regression, if not and you require some kind of data-set, let me know and I’ll try to cook something out. There is some sensitive data that I can’t share. But I may be able to fiddle with the entries if this may aid?

Issue Analytics

State:
Created 2 years ago
Comments:16 (6 by maintainers)

Top GitHub Comments

2reactions

tlowe-fhcommented, Sep 9, 2021

Dropping a search engine signpost to help the next person who runs into this issue. On upgrading from seaborn==0.11.1->0.11.2, running sns.histplot on a dataframe with a non-unique index yields ValueError: cannot reindex from a duplicate axis.

I was able to resolve this by running df = df.reset_index(drop=True) which replaces the non-unique index with unique consecutive integers.

2reactions

mwaskomcommented, Sep 7, 2021

Thanks @ricardoV94 that’s much more helpful.

I can reproduce your performance issue now. I can also track down the source of it. Your dataframe has a (heavily) duplicated index:

len(df.index.unique()) / len(df)
0.024031929347826088

Therefore, if I do:

%time sns.lineplot(data=df.reset_index(drop=True), x="x", y="y")

it solves the problem:

CPU times: user 823 ms, sys: 6.8 ms, total: 830 ms
Wall time: 846 ms

https://github.com/mwaskom/seaborn/pull/2417 introduced the relevant change. Interestingly, this awkward construction

comp_col = pd.Series(index=orig.index, dtype=float, name=var)
comp_col.loc[orig.index] = pd.to_numeric(axis.convert_units(orig))
comp_data.insert(0, var, comp_col)

which IIRC is a relic of the initial implementation, is much slower (with a duplicate index) than

comp_col = pd.Series(
    pd.to_numeric(axis.convert_units(orig))
    index=orig.index, dtype=float, name=var
)
comp_data.insert(0, var, comp_col)

That said, duplicate indices are causing problems elsewhere. Duplicate indices are, IMO, a rough edge in pandas. The index is supposed to act like a primary key in a database and uniqueness really ought to be enforced. Pandas explicitly chooses not to by default, and while their reasoning make sense, a consequence is that certain operations fail or (apparently) run many orders of magnitude slower with certain dataframes but not others.

Therefore, it’s strongly recommended for users (of pandas) to reset the index on a dataframe after performing an operation that introduces duplicates.

But ultimately, there needs to be a decision in seaborn about how to handle duplicate indices, and that should happen upstream of this code. The possibilities seem to be: a) raise with an informative error or b) internally reset the index if it has duplicates.

So:

This issue can be solved in user-space on v0.11.2 by not giving seaborn a dataframe with a duplicate index
The offending code in seaborn can be easily rewritten to be both cleaner and more performant
seaborn needs to make a decision about upstream handling of this case that would avoid the specific problem anyway

Top Results From Across the Web

pandas.Index.duplicated — pandas 1.5.2 documentation

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or...