histplot common normalization ignores weights
See original GitHub issueWhen using weights
in histplot
for probability plots it seems seaborn does not take into accounts the weights when using common_norm
See this small example:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
data = pd.DataFrame({
'size': [1., 2., 2.],
'cut': ['I', 'N', 'N'],
'price': [100, 100, 100],
})
print(data)
f, axs = plt.subplots(figsize=(7, 5), nrows=2,ncols=2)
for i, com_norm in enumerate([True, False]):
sns.histplot(
data,
x="price", hue="cut",
stat='probability',
multiple="stack",
weights='size',
common_norm=com_norm,
discrete=True,
ax=axs[0][i],
)
axs[0][i].set_title(f"stat=probability, common_norm={com_norm}")
sns.histplot(
data,
x="price", hue="cut",
multiple="stack",
weights='size',
common_norm=com_norm,
discrete=True,
ax=axs[1][i],
)
axs[1][i].set_title(f"stat=count, common_norm={com_norm}")
plt.show()
Which yields.
I would have assumed that weights are taken into account when calculating the probability. From the code: https://github.com/mwaskom/seaborn/blob/78e9c0800514b30b1e89a47db58e3f564ee51903/seaborn/distributions.py#L480 I can see that it only takes into account the length of the data set which is reflected in the plot. I would have expected that the top left plot would have shown that I
has a probability of 20% 1/(1+2+2)
. While the documentation for weights
is
If provided, weight the contribution of the corresponding data points towards the count in each bin by these factors.
it does not mention anything about other than stat='count'
plots it would make sense to take this into account?
In any case being able to weigh the probability with the weights using a common-norm would be useful if the histplot does as intended.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top GitHub Comments
The diff for fixing this is:
It can be altered as needed.
I don’t have any preference, this was my quick hack 😉 On the other hand summing for very large data-sets where it isn’t needed might be annoying/slow?