Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

geom_density ignores "weight" argument

See original GitHub issue

Hi, I noticed that when running pn.ggplot(df, pn.aes(x="x, weight="w")) + pn.geom_density() the density is ignored. I am using plotnine version 0.6.0.

I validated the difference by running df.reindex(df.index.repeat(df["w"])) and plotting this without the weight argument.

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

has2k1commented, Apr 29, 2020

I do not think you can do that, because for a kernel density algorithm there are two ways to affect the contribution of any distinct value towards the final density.

It’s frequency (i.e. addition)
It’s weight (i.e. multiplication, which is an shortcut of addition)

For stability of the algorithms the weighting (multiplication) is normalised to the [0, 1] domain for any given density computation. That shuts out option 2 leaving you with option 1.

So maybe you can make it easier by creating a helper function using something like

def weight_to_frequency(df, wt, precision=3):
    ns = np.round(((wt/sum(wt)) * (10**precision))).astype(int)  # no. times to replicate
    idx = np.repeat(df.index, ns)                     # selection indices
    df = df.loc[idx].reset_index(drop=True)     # replication
    return df

to come up with integer replication factors.

1reaction

pkhokhlovcommented, Apr 29, 2020

I encountered this issue as well. Please see the example below:

import pandas as pd
import plotnine as pn
import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200,
                           n_features=1,
                           n_informative=1,
                           n_redundant=0,
                           n_clusters_per_class=1,
                           random_state=2)

df = pd.DataFrame({"x" : X.T[0], "y" : y})
df.y = df.y.astype("category")

df["wt"] = np.where(df["y"] == 1, 5, 1)

(pn.ggplot(df, pn.aes("x", fill="y")) +
            pn.geom_density(position="fill") +
            pn.theme_seaborn(style="whitegrid"))

Produces the following plot: stacked_density1

If we do:

(pn.ggplot(df, pn.aes("x", fill="y", weight="wt")) +
 pn.geom_density(position="fill") +
 pn.theme_seaborn(style="whitegrid"))

(pn.ggplot(df, pn.aes("x", fill="y")) +
 pn.geom_density(pn.aes(weight="wt"), position="fill") +
 pn.theme_seaborn(style="whitegrid"))

we get the same plot. However, if we do:

df2 = df.reindex(df.index.repeat(df["wt"]))

(pn.ggplot(df2, pn.aes("x", "stat(count)", fill="y")) +
 pn.geom_density(position="fill") +
 pn.theme_seaborn(style="whitegrid"))

We get: stacked_density2