Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scatter plot with colour_by and size_by variables

See original GitHub issue

Problem description

Use case: Say we have a df with 4 columns- a, b, c, d. We want to make a scatter plot, with x=a, y=b, color_by=c and size_by=d. Here, if c is a categorical, we get a discrete set of colours and corresponding legend, else a continuous scale. size_by decides the size of the marker.

Such cases are often needed as evidenced by questions on Stack Overflow.

Image below shows an example.

I wrote a blog post(hand-wavy at times- marker size legend) on how to generate such a plot in Pandas. The code below shows how to make a similar plot.

Code Sample, a copy-pastable example if possible

import matplotlib.pyplot as plt
import pandas as pd
midwest= pd.read_csv("http://goo.gl/G1K41K") 
# Filtering
midwest= midwest[midwest.poptotal<50000]

fig, ax = plt.subplots()
groups = midwest.groupby('state')

# Tableau 20 Colors
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),  
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),  
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),  
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),  
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
             

# Rescale to values between 0 and 1 
for i in range(len(tableau20)):  
    r, g, b = tableau20[i]  
    tableau20[i] = (r / 255., g / 255., b / 255.)

colors = tableau20[::2]

# Plotting each group 
for i, (name, group) in enumerate(groups):
    group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)),
               s=10+group['popdensity']*0.1, # hand-wavy :(
               label=name, ax=ax, color=colors[i])

# Legend for State colours
lgd = ax.legend(numpoints=1, loc=1, borderpad=1, 
            frameon=True, framealpha=0.9, title="state")
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])

# Make a legend for popdensity. Hand-wavy. Error prone!
pws = (pd.cut(midwest['popdensity'], bins=4, retbins=True)[1]).round(0)
for pw in pws:
    plt.scatter([], [], s=(pw**2)/2e4, c="k",label=str(pw))

h, l = plt.gca().get_legend_handles_labels()
plt.legend(h[5:], l[5:], labelspacing=1.2, title="popdensity", borderpad=1, 
            frameon=True, framealpha=0.9, loc=4, numpoints=1)

plt.gca().add_artist(lgd)

This produces the following plot: download 1

I was wondering, if the use case is important enough to introduce changes in the API for scatter plot, so that color_by and size_by arguments can be passed? I understand that the same set of arguments are used across different plots, and a size_by will not make sense for many plots.

If this will not make it into the API, it still might be useful to have a detailed example in the cookbook. Or, a function that would work out of the box for such plots.

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:9 (5 by maintainers)

Top GitHub Comments

2reactions

VincentAntoinecommented, Sep 16, 2017

Hey!

I’ve made progress with the sizes, haven’t looked at colors yet. Taking the same data as @nipunbatra in his example above, this is what I have now:

import matplotlib.pyplot as plt
import pandas as pd

# fetching and filtering data
midwest = pd.read_csv("http://goo.gl/G1K41K")
midwest = midwest[midwest['poptotal'] < 50000]

# plotting
midwest.plot(kind='scatter', x='area', y='poptotal', s='popdensity',
             title='Popuation vs area and density')
plt.show()

figure_1

And if you want to make the bubbles smaller or bigger, you can use s_grow (defaut 1) to change that:

midwest.plot(kind='scatter', x='area', y='poptotal', s='popdensity',
             title='Popuation vs area and density', s_grow=0.2)
plt.show()

figure_2

Here is what I did so far:

1: grab the data, normalize the data appropriately to get reasonable bubble sizes, pass that to matplotlib to make the bubble plot
2: make the legend

Grabbing & normalizing data

Compared to what I explained in my previous post, I only slightly modified the init method of the ScatterPlot class to turn s_grow, size_title, size_data_max and bubble_points (the default bubble max size of 200 points) into attributes of ScatterPlot instances, as that makes these 4 parameters easily accessible to the other methods when building the legend for the bubble sizes.

class ScatterPlot(PlanePlot):
    _kind = 'scatter'

    def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs):
        if s is None:
            # Set default size if no argument is given
            s = 20
        elif is_hashable(s) and s in data.columns:
            # Handle the case where s is a label of a column of the df
            # The data is normalized to 200 * s_grow
            size_data = data.loc[:, s].values
            if is_numeric_dtype(size_data):
                self.size_title = s
                self.size_data_max=size_data.max()
                self.s_grow= s_grow
                self.bubble_points = 200
                s = self.bubble_points * s_grow * size_data / self.size_data_max
            else:
                raise TypeError('s must be of numeric dtype')
        super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs)

Building the legend

Before actually building the legend, we must define the sizes and labels of the bubbles to include in the legend. For instance if we want 4 bubbles in our legend, a straighforward approach is to use data_max, 0.75 * data_max, 0.5 * data_max and 0.25 * data_max. However as you can see in the graph built by @nipunbatra this leads to values like 82, 733, 1382… which is not as nice having labels with “round” values like in the graph produced by Altair (see @nipunbatra 's blog post).

I have therefore tried to achieve this nice behaviour and to build a legend with round values. In order to make a legend with 4 bubbles, we therefore need to define 4 bubble sizes and the 4 corresponding labels, with ‘round’ values for the labels, the biggest of which is close to the maximum of the data.

For this I first need a helper function to extract the mantissa (or coefficient) and exponent of a number in decimal base.

#class ScatterPlot(PlanePlot):
    def _sci_notation(self, num):
        scientific_notation = '{:e}'.format(num)
        expnt = float(re.search(r'e([+-]\d*)$', scientific_notation).groups()[0])
        coef = float(re.search(r'^([+-]?\d\.\d)', scientific_notation).groups()[0])
        return coef, expnt

Example: _sci_notation(782489.89247823) returns (7.8, 5.0)

Then, given a data_max, s_grow and bubble_points, this function finds 4 appropriate sizes and labels for the legend:

#class ScatterPlot(PlanePlot):
    def _legend_bubbles(self, data_max, s_grow, bubble_points):
        coef, expnt = self._sci_notation(data_max)
        labels_catalog = {
            (9, 10) : [10, 5, 2.5, 1],
            (7, 9) : [8, 4, 2, 0.5],
            (5.5, 7) : [6, 3, 1.5, 0.5],
            (4.5, 5.5) : [5, 2, 1, 0.2],
            (3.5, 4.5) : [4, 2, 1, 0.2],
            (2.5, 3.5) : [3, 1, 0.5, 0.2],
            (1.5, 2.5) : [2, 1, 0.5, 0.2],
            (0, 1.5) : [1, 0.5, 0.25, 0.1]
        }
        for lower_bound, upper_bound in labels_catalog:
            if (coef >= lower_bound) & (coef < upper_bound):
                labels = np.array(labels_catalog[lower_bound, upper_bound]) * 10**expnt
                sizes = list(bubble_points * s_grow * labels / data_max)
                labels =['{:g}'.format(l) for l in labels]
                return (sizes, labels)

Example: _legend_bubbles(data_max = 2678.0588199999, s_grow = 1, bubble_points = 200) returns: ([224.04287595147829, 74.680958650492769, 37.340479325246385, 14.936191730098553], [‘3000’, ‘1000’, ‘500’, ‘200’])

The first list gives 4 bubbles sizes (in points) and the second list the 4 corresponding labels.

In our exemple with population density, the maximum of popdensity is 2678.0588199999. So what happens is:

we compute mantissa (2.6) and exponent (3.0)
2.6 lies between 2.5 and 3.5, so in the labels_catalog we pick [3, 1, 0.5, 0.2]
we compute the labels which are 3e3, 1e3, 0.5e3 and 0.2e3
we compute bubble sizes corresponding to these labels, that is bubble_points * s_grow * 3e3 / 2678.0588199999 etc

Finally, we put all the pieces together in a _make_legend method which is specific to the ScatterPlot class. After building the legend for the bubbles, we call the _make_legend method of the parent.

#class ScatterPlot(PlanePlot):
    def _make_legend(self):
        if hasattr(self, "size_title"):
            ax = self.axes[0]
            size_title = self.size_title
            data_max = self.size_data_max
            s_grow = self.s_grow
            bubble_points = self.bubble_points
            import matplotlib.legend as legend
            sizes, labels=self._legend_bubbles(data_max, s_grow, bubble_points)
            bubbles=[]
            for size in sizes:
                bubbles.append(ax.scatter([], [], s=size, color='white', edgecolor='gray'))
            bubble_legend=legend.Legend(ax, handles=bubbles, labels=labels, loc='lower right')
            bubble_legend.set_title(size_title)
            ax.add_artist(bubble_legend)
        super()._make_legend()

I also have a few questions:

my helper function to grab mantissa and exponent of a number should probably not live in the ScatterPlot class, but I don’t really know where to put it. Any idea?
if we use large values for s_grow, and the bubbles become quite large, the bubbles in the legend will become so big that they will overlap and/or they may hide the labels and legend title. We can make the legend layout somewhat adaptative with respect to s_grow by scaling labelspacing, borderpad and handletextpad proportionnaly to sqrt(s_grow), but the result is not always very good. A more flexible approach would be to put the legend in the separate subplot. We could thus place the legend outside of the main plot axes, like in the graph produced by Altair, and having the legend in a separate subplot gives much more layout options and flexibility. Is this an approach you would like me to try or do you prefer keeping it as it is?

How does this look to you?

Thanks! Vincent

0reactions

sorenwackercommented, Feb 20, 2020

Is anyone still working on this? I miss this functionality. If the column contains strings the method should use distinct colors. Similar to what happens in plotly plots. Same with shapes.