Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

histplot common normalization ignores weights

See original GitHub issue

When using weights in histplot for probability plots it seems seaborn does not take into accounts the weights when using common_norm

See this small example:

import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

data = pd.DataFrame({
    'size': [1., 2., 2.],
    'cut': ['I', 'N', 'N'],
    'price': [100, 100, 100],
})

print(data)

f, axs = plt.subplots(figsize=(7, 5), nrows=2,ncols=2)

for i, com_norm in enumerate([True, False]):
    sns.histplot(
        data,
        x="price", hue="cut",
        stat='probability',
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[0][i],
    )
    axs[0][i].set_title(f"stat=probability, common_norm={com_norm}")

    sns.histplot(
        data,
        x="price", hue="cut",
        multiple="stack",
        weights='size',
        common_norm=com_norm,
        discrete=True,
        ax=axs[1][i],
    )
    axs[1][i].set_title(f"stat=count, common_norm={com_norm}")
plt.show()

Which yields. data

I would have assumed that weights are taken into account when calculating the probability. From the code: https://github.com/mwaskom/seaborn/blob/78e9c0800514b30b1e89a47db58e3f564ee51903/seaborn/distributions.py#L480 I can see that it only takes into account the length of the data set which is reflected in the plot. I would have expected that the top left plot would have shown that I has a probability of 20% 1/(1+2+2). While the documentation for weights is

If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.

it does not mention anything about other than stat='count' plots it would make sense to take this into account?

In any case being able to weigh the probability with the weights using a common-norm would be useful if the histplot does as intended.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

zerothicommented, Sep 9, 2021

The diff for fixing this is:

diff --git a/seaborn/distributions.py b/seaborn/distributions.py
index 5f63289..8329807 100644
--- a/seaborn/distributions.py
+++ b/seaborn/distributions.py
@@ -424,6 +424,12 @@ class _DistributionPlotter(VectorPlotter):
                 warn_singular=False,
             )
 
+        sum_weight = 0.
+        if common_norm:
+            for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
+                if "weights" in self.variables:
+                    sum_weight += sub_data["weights"].sum()
+
         # First pass through the data to compute the histograms
         for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
 
@@ -464,12 +470,21 @@ class _DistributionPlotter(VectorPlotter):
             hist = pd.Series(heights, index=index, name="heights")
 
             # Apply scaling to normalize across groups
-            if common_norm:
+            if common_norm and weights is None:
                 hist *= len(sub_data) / len(all_data)
+            elif common_norm:
+                hist *= weights.sum() / sum_weight
 
             # Store the finalized histogram data for future plotting
             histograms[key] = hist

It can be altered as needed.

0reactions

zerothicommented, Sep 6, 2021

I think it would be cleaner to have default weights of 1, then you can just proceed from there by summing the weights for each group to compute the relevant numerator/denominator of the scaling factor and don’t need to repeat conditionals in multiple places.

I don’t have any preference, this was my quick hack 😉 On the other hand summing for very large data-sets where it isn’t needed might be annoying/slow?