Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

mosaicplot should handle crosstab and frequency-count DataFrames

See original GitHub issue

I have this dataset, in frequency-count form: speeddatting.csv

DecisionM,IntelligentM,freq
0, 5, 9
0, 6, 21
0, 7, 35
0, 8, 35
0, 9, 14
0, 10, 10
1, 5, 11
1, 6, 12
1, 7, 30
1, 8, 48 
1, 9, 27
1, 10, 16

I want to make a mosaic plot of it. This should be easy; the data is already counted for us. For reference, I am looking for this: expected output

But this does not work:

import pandas
import statsmodels
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic

D = pandas.read_csv("speeddating.csv")
print(D)

mosaic() ignores the freq column (I guess I never told it to look there, so fair enough), decides there are equal amounts (i.e. 1) of each case, and I see fail#1

The docs for mosaic() don’t give any insight (and by the way, currently the “Using a DataFrame as source,” section isn’t being parsed as code by your Sphinx run: http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.mosaicplot.mosaic.html).

I tried reformatting

G = {(I,d) : D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)}

mosaic(G)
plt.show()

But this is awkward to write and comes out wrong because dictionaries are unordered (I see that I could fix that with the index parameter, but how to use that is unclear to me and just makes the code above even more awkard). fail#2

The way I got that first (correct) image was computationally expensive: I rewrote the DataFrame in the form mosaic() wants and let it crunch it up again:

import itertools
flatten = lambda L: list(itertools.chain.from_iterable(L))

G2 = flatten([[(I,d)]*D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)])
G2 = pandas.DataFrame(G2)
G2.columns = ["IntelligentM", "DecisionM"]
print(G2)

mosaic(G2,["IntelligentM", "DecisionM"])
plt.show()

To be clear, G2 looks like:

     IntelligentM  DecisionM
0               5          0
1               5          0
2               5          0
3               5          0
4               5          0
5               5          0
6               5          0
7               5          0
8               5          0
9               6          0
10              6          0
11              6          0
12              6          0
13              6          0
14              6          0
15              6          0
16              6          0
17              6          0
18              6          0
19              6          0
20              6          0
21              6          0
22              6          0
23              6          0
24              6          0
25              6          0
26              6          0
27              6          0
28              6          0
29              6          0
..            ...        ...
212             8          1
213             8          1
214             8          1
215             9          1
216             9          1
217             9          1
218             9          1
219             9          1
220             9          1
221             9          1
222             9          1
223             9          1
224             9          1
225             9          1
226             9          1
227             9          1
228             9          1
229             9          1
230             9          1
231             9          1
232             9          1
233             9          1
234             9          1
235             9          1
236             9          1
237             9          1
238             9          1
239             9          1
240             9          1
241             9          1

[242 rows x 2 columns]

This is even more verbose than the last because mosaic() demands labelled columns. It’s also slow. I want to avoid having to do this, but it’s the only way I’ve found so far.

Further, sometimes contingency data comes in crosstab form; in fact I can generate it from the above:

In [4]: pandas.crosstab(G2.IntelligentM, G2.DecisionM)
Out[4]: 
DecisionM      0   1
IntelligentM        
5              9  11
6             21  12
7             35  30
8             35  48
9             14  27

But mosaic() fails badly on this:

In [7]: mosaic(pandas.crosstab(G2.IntelligentM, G2.DecisionM),["IntelligentM","DecisionM"])
...
KeyError: "['IntelligentM' 'DecisionM'] not in index"

Can mosaic() gain a freq= param that shortcircuits [_normalize_dataframe()](https://github.com/statsmodels/statsmodels/blob/b235b7bff9890f4b6280f3c0f4a57c605f3b81e0/statsmodels/graphics/mosaicplot.py#L322) and something similar to handle crosstabs? Or maybe it would be better–if less elegant–to be split to mosaic_contigency(), mosaic_items(), mosaic_crosstab(). contingency mode would operate on dicts {(var1, var2): count}, item mode would operate on tables (var1, var2, count) and crosstab mode would operate on matrices [var1, var2]: count.

I’m on statsmodels 0.6.1 on Python3.