question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

mosaicplot should handle crosstab and frequency-count DataFrames

See original GitHub issue

I have this dataset, in frequency-count form: speeddatting.csv

DecisionM,IntelligentM,freq
0, 5, 9
0, 6, 21
0, 7, 35
0, 8, 35
0, 9, 14
0, 10, 10
1, 5, 11
1, 6, 12
1, 7, 30
1, 8, 48 
1, 9, 27
1, 10, 16

I want to make a mosaic plot of it. This should be easy; the data is already counted for us. For reference, I am looking for this: expected output

But this does not work:

import pandas
import statsmodels
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic

D = pandas.read_csv("speeddating.csv")
print(D)

mosaic() ignores the freq column (I guess I never told it to look there, so fair enough), decides there are equal amounts (i.e. 1) of each case, and I see fail#1

The docs for mosaic() don’t give any insight (and by the way, currently the “Using a DataFrame as source,” section isn’t being parsed as code by your Sphinx run: http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.mosaicplot.mosaic.html).

I tried reformatting

G = {(I,d) : D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)}

mosaic(G)
plt.show()

But this is awkward to write and comes out wrong because dictionaries are unordered (I see that I could fix that with the index parameter, but how to use that is unclear to me and just makes the code above even more awkard). fail#2

The way I got that first (correct) image was computationally expensive: I rewrote the DataFrame in the form mosaic() wants and let it crunch it up again:

import itertools
flatten = lambda L: list(itertools.chain.from_iterable(L))

G2 = flatten([[(I,d)]*D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)])
G2 = pandas.DataFrame(G2)
G2.columns = ["IntelligentM", "DecisionM"]
print(G2)

mosaic(G2,["IntelligentM", "DecisionM"])
plt.show()

To be clear, G2 looks like:

     IntelligentM  DecisionM
0               5          0
1               5          0
2               5          0
3               5          0
4               5          0
5               5          0
6               5          0
7               5          0
8               5          0
9               6          0
10              6          0
11              6          0
12              6          0
13              6          0
14              6          0
15              6          0
16              6          0
17              6          0
18              6          0
19              6          0
20              6          0
21              6          0
22              6          0
23              6          0
24              6          0
25              6          0
26              6          0
27              6          0
28              6          0
29              6          0
..            ...        ...
212             8          1
213             8          1
214             8          1
215             9          1
216             9          1
217             9          1
218             9          1
219             9          1
220             9          1
221             9          1
222             9          1
223             9          1
224             9          1
225             9          1
226             9          1
227             9          1
228             9          1
229             9          1
230             9          1
231             9          1
232             9          1
233             9          1
234             9          1
235             9          1
236             9          1
237             9          1
238             9          1
239             9          1
240             9          1
241             9          1

[242 rows x 2 columns]

This is even more verbose than the last because mosaic() demands labelled columns. It’s also slow. I want to avoid having to do this, but it’s the only way I’ve found so far.

Further, sometimes contingency data comes in crosstab form; in fact I can generate it from the above:

In [4]: pandas.crosstab(G2.IntelligentM, G2.DecisionM)
Out[4]: 
DecisionM      0   1
IntelligentM        
5              9  11
6             21  12
7             35  30
8             35  48
9             14  27

But mosaic() fails badly on this:

In [7]: mosaic(pandas.crosstab(G2.IntelligentM, G2.DecisionM),["IntelligentM","DecisionM"])
...
KeyError: "['IntelligentM' 'DecisionM'] not in index"

Can mosaic() gain a freq= param that shortcircuits [_normalize_dataframe()](https://github.com/statsmodels/statsmodels/blob/b235b7bff9890f4b6280f3c0f4a57c605f3b81e0/statsmodels/graphics/mosaicplot.py#L322) and something similar to handle crosstabs? Or maybe it would be better–if less elegant–to be split to mosaic_contigency(), mosaic_items(), mosaic_crosstab(). contingency mode would operate on dicts {(var1, var2): count}, item mode would operate on tables (var1, var2, count) and crosstab mode would operate on matrices [var1, var2]: count.

I’m on statsmodels 0.6.1 on Python3.

Issue Analytics

  • State:open
  • Created 9 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
ksheddencommented, Feb 17, 2015

If the input is a crosstab the easiest solution is to use “stack”:

tab = pd.crosstab(x, y) mosaic(tab.stack())

0reactions
emilebrescommented, Feb 23, 2016

kshedden’s solution works for me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Mosaic Plot in Python - Medium
Crosstab function just gives us a table of numbers whereas Mosaic Plot gives it'd graphical diagram which we can use in the data...
Read more >
[Solved]-How to determine correlation from dataframe with Nan?
You must first get rid of NaN values: df2=df.dropna(). Or replace them by mean: df2 = df.fillna(df.mean()). Or use an algorithm like EM...
Read more >
How to create mosaic plot from Pandas dataframe with ...
I used your data and this code: mosaic(myDataframe, ['size', 'length']). and got the chart like this: mosaic chart.
Read more >
Statistical Disclosure Control: A Practice Guide - Read the Docs
tools and guidelines for the anonymization of microdata should further reduce or ... SDC seeks to treat and alter the data so that...
Read more >
Pandas DataFrames: Crosstabs, Cross Tabulation ... - YouTube
Your browser can 't play this video. ... 45- Pandas DataFrames : Crosstabs, Cross Tabulation, Generating Contingency Tables.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found