question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

histplot common normalization ignores weights

See original GitHub issue

When using weights in histplot for probability plots it seems seaborn does not take into accounts the weights when using common_norm

See this small example:

import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

data = pd.DataFrame({
    'size': [1., 2., 2.],
    'cut': ['I', 'N', 'N'],
    'price': [100, 100, 100],
})

print(data)

f, axs = plt.subplots(figsize=(7, 5), nrows=2,ncols=2)

for i, com_norm in enumerate([True, False]):
    sns.histplot(
        data,
        x="price", hue="cut",
        stat='probability',
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[0][i],
    )
    axs[0][i].set_title(f"stat=probability, common_norm={com_norm}")

    sns.histplot(
        data,
        x="price", hue="cut",
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[1][i],
    )
    axs[1][i].set_title(f"stat=count, common_norm={com_norm}")
plt.show()

Which yields. data

I would have assumed that weights are taken into account when calculating the probability. From the code: https://github.com/mwaskom/seaborn/blob/78e9c0800514b30b1e89a47db58e3f564ee51903/seaborn/distributions.py#L480 I can see that it only takes into account the length of the data set which is reflected in the plot. I would have expected that the top left plot would have shown that I has a probability of 20% 1/(1+2+2). While the documentation for weights is

If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.

it does not mention anything about other than stat='count' plots it would make sense to take this into account?

In any case being able to weigh the probability with the weights using a common-norm would be useful if the histplot does as intended.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
zerothicommented, Sep 9, 2021

The diff for fixing this is:

diff --git a/seaborn/distributions.py b/seaborn/distributions.py
index 5f63289..8329807 100644
--- a/seaborn/distributions.py
+++ b/seaborn/distributions.py
@@ -424,6 +424,12 @@ class _DistributionPlotter(VectorPlotter):
                 warn_singular=False,
             )
 
+        sum_weight = 0.
+        if common_norm:
+            for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
+                if "weights" in self.variables:
+                    sum_weight += sub_data["weights"].sum()
+
         # First pass through the data to compute the histograms
         for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
 
@@ -464,12 +470,21 @@ class _DistributionPlotter(VectorPlotter):
             hist = pd.Series(heights, index=index, name="heights")
 
             # Apply scaling to normalize across groups
-            if common_norm:
+            if common_norm and weights is None:
                 hist *= len(sub_data) / len(all_data)
+            elif common_norm:
+                hist *= weights.sum() / sum_weight
 
             # Store the finalized histogram data for future plotting
             histograms[key] = hist

It can be altered as needed.

0reactions
zerothicommented, Sep 6, 2021

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don’t need to repeat conditionals in multiple places.

I don’t have any preference, this was my quick hack 😉 On the other hand summing for very large data-sets where it isn’t needed might be annoying/slow?

Read more comments on GitHub >

github_iconTop Results From Across the Web

seaborn.histplot — seaborn 0.12.1 documentation - PyData |
This function can normalize the statistic computed within each bin to estimate frequency, density or probability mass, and it can add a smooth...
Read more >
matplotlib.pyplot.hist — Matplotlib 3.1.2 documentation
If normed or density is True , the weights are normalized, so that the integral of the density over the range remains 1....
Read more >
Python: Histogram with area normalized to something other ...
For example, I think as it stands, values outside the binned range are ignored when normalizing, which isn't generally what you want.
Read more >
How to Normalize and Standardize Time Series Data in Python
Two techniques that you can use to consistently rescale your time series data are normalization and standardization.
Read more >
Chapter: Fitting Histograms - ROOT
The Fit method is implemented in ROOT for the histogram classes TH1 , the sparse ... “ W ” Set all weights to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found