question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

norm_hist and kde

See original GitHub issue

Hi all,

(first of all: awesome library, I love it)

I am wondering about the default behavior of distplot when norm_hist is False.

At least on 0.8.0, when

sns.distplot(x, norm_hist=False)

produces a figure that is 1) normalized and 2) still has the KDE, which is a bit of a gotcha (i.e. unless you carefully read the docs for norm_hist and kde and infer if kde is default-True, and it might override norm_hist=False.

If you run:

sns.distplot(x, norm_hist=False, kde=False)

This will give you an unnormed, sans-KDE distribution.

Which itself is a little disappointing since the KDE is actually super nice for understanding the structure of the data.

I can think of two potential ways to address this mild annoyance:

  1. default kde=None and have it infer if it should compute a KDE from the value of norm_hist, or
  2. if norm_hist=False, compute the KDE of the normalized figure, but then multiply it by the integration value of the distribution to put it on the plot. (I am not a statistician, so this seems fine to me, but perhaps isn’t kosher for some reason?)

I’d be open to doing this myself (esp 2), as long as I know you’ll accept the PR 😅 .

Cheers!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

5reactions
Abysselenecommented, Feb 22, 2019

Hi. Actually I’m quite fond of norm_hist and would appreciate an evolution of it rather than seeing it disappear. As seen in #479 #1396 and #61, in certain situations it’s problematic to not be able to scale or “denormalize” a kde in distplot.

Here is my situation, I plot 2 histograms on the same axes to see the differences. At first I used matplotlib’s hist as a ‘stepfilled’ with low alpha.

nWidthBar=20
plt.hist([tJour["Success"], tJour["Fail"]], bins=np.arange(0.0, 1.0, 1/nWidthBar), histtype='stepfilled', alpha=0.1, label=["Pari gagné", "Pari perdu"])

capture hist ok As you can see, values are between 0.0 and 1.0. Both sets can start and end with different minimal and maximal values. I had to set the bins as a list to correct this. Important: The sets don’t have the same quantity of values hence if normalized I will not be able to see where and how much one set is above the other.

I don’t really care about the values on y axis, I want to keep the proportion between both sets as said in #61 I wanted to have a better visualization with kde using distplot, I know kde is about density and having an area of 1 under the curve but as I said I don’t care about the values, I just need to keep the correct proportion between both sets. Here is the code, range was used to keep the same bins width with both sets with kde.


ax = plt.subplot(nbLignes, 3, nInd+1)
sns.distplot(tJour["Success"], ax=ax, norm_hist=False, bins=20, hist = True, kde = True, hist_kws={'range':(0,1)}, kde_kws = {'shade': True, 'linewidth': 3, 'bw': 1/40}, label="Pari gagné")
sns.distplot(tJour["Fail"],    ax=ax, norm_hist=False, bins=20, hist = True, kde = True, hist_kws={'range':(0,1)}, kde_kws = {'shade': True, 'linewidth': 3, 'bw': 1/40}, label="Pari perdu")

capture failed normalization You can see the problem, both sets are normalized without taking into account the other so the blue set becomes as big as the orange one. It doesn’t show anymore that I have very few blues.

I would have like to be able to correct this by giving both sets to one distplot rather than doing 2 distplots or by adding something like norm_kde=False to keep the height of the kde as it is for the histogram. I did it by drawing on different axes and changing the ylim of each kde in function of the area occupied by each set since whatever the base area, a kde will have an area of 1.0

nWidthBar = 20
arr, _, _ = plt.hist([tJour["Success"], tJour["Fail"]], bins=np.arange(0.0, 1.0, 1/nWidthBar), histtype='stepfilled', alpha=0.1, label=["Pari gagné", "Pari perdu"])
tSurfaces = []
for tab in arr: # get original areas
	tSurfaces.append(np.sum(tab)/nWidthBar)
ax1 = ax.twinx()
ylimMax = ax.get_ylim()[1]
ax1.set_ylim(top=ylimMax/tSurfaces[0]) # scale kde set0
ax1.yaxis.set_ticks([])
sns.distplot(tJour["Succes"], ax=ax1, bins=nWidthBar, hist = False, kde = True, hist_kws={'range':(0,1)}, kde_kws = {'shade': True, 'linewidth': 3, 'bw': 1/40}, color='C0')
ax2 = ax.twinx()
ax2.set_ylim(top=ylimMax/tSurfaces[1]) # scale kde set 1
ax2.yaxis.set_ticks([])
sns.distplot(tJour["Fail"],    ax=ax2, bins=nWidthBar, hist = False, kde = True, hist_kws={'range':(0,1)}, kde_kws = {'shade': True, 'linewidth': 3, 'bw': 1/40}, color='C1')

capture finale

So what I mean is:

  • With multiple sets to plot, normalized or not, sometimes it’s important to keep the relative proportion to compare them.
  • I prefer to compare them with kde over histograms.
  • A lot of code has to be written rather than just giving a list with both sets to distplot (like in matplot.hist) and setting a norm=False
0reactions
mwaskomcommented, Jun 14, 2020

Closed with #2125

Read more comments on GitHub >

github_iconTop Results From Across the Web

Histograms vs. KDEs Explained - Towards Data Science
Building upon the histogram example, I will explain how to construct a KDE and why you should add KDEs to your data science...
Read more >
In-Depth: Kernel Density Estimation | Python Data Science ...
Kernel density estimation (KDE) is in some senses an algorithm which ... the standard count-based histogram can be created with the plt.hist() function....
Read more >
Python: "Normalizing" kde, so it always lines up with histogram
The lines statement overlays the default kernel density estimator (KDE) of the density procedure onto the histogram. One can change the ...
Read more >
seaborn.kdeplot — seaborn 0.12.1 documentation - PyData |
A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset, analogous to a histogram. KDE...
Read more >
Adding KDE and Normal distribution to a Histogram
To demonstrate what I meant in the comment: fig, ax = plt.subplots() data.plot.hist(ax=ax, alpha=0.5) ax2 = ax.twinx() data.plot.kde(ax=ax2).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found