mosaicplot should handle crosstab and frequency-count DataFrames
See original GitHub issueI have this dataset, in frequency-count form:
speeddatting.csv
DecisionM,IntelligentM,freq
0, 5, 9
0, 6, 21
0, 7, 35
0, 8, 35
0, 9, 14
0, 10, 10
1, 5, 11
1, 6, 12
1, 7, 30
1, 8, 48
1, 9, 27
1, 10, 16
I want to make a mosaic plot of it. This should be easy; the data is already counted for us. For reference, I am looking for this:
But this does not work:
import pandas
import statsmodels
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
D = pandas.read_csv("speeddating.csv")
print(D)
mosaic()
ignores the freq
column (I guess I never told it to look there, so fair enough), decides there are equal amounts (i.e. 1) of each case, and I see
The docs for mosaic()
don’t give any insight (and by the way, currently the “Using a DataFrame as source,” section isn’t being parsed as code by your Sphinx run: http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.mosaicplot.mosaic.html).
I tried reformatting
G = {(I,d) : D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)}
mosaic(G)
plt.show()
But this is awkward to write and comes out wrong because dictionaries are unordered (I see that I could fix that with the index
parameter, but how to use that is unclear to me and just makes the code above even more awkard).
The way I got that first (correct) image was computationally expensive: I rewrote the DataFrame in the form mosaic()
wants and let it crunch it up again:
import itertools
flatten = lambda L: list(itertools.chain.from_iterable(L))
G2 = flatten([[(I,d)]*D.loc[(D.DecisionM == d) & (D.IntelligentM==I), "freq"].iloc[0] for d in [0,1] for I in range(5,10)])
G2 = pandas.DataFrame(G2)
G2.columns = ["IntelligentM", "DecisionM"]
print(G2)
mosaic(G2,["IntelligentM", "DecisionM"])
plt.show()
To be clear, G2 looks like:
IntelligentM DecisionM
0 5 0
1 5 0
2 5 0
3 5 0
4 5 0
5 5 0
6 5 0
7 5 0
8 5 0
9 6 0
10 6 0
11 6 0
12 6 0
13 6 0
14 6 0
15 6 0
16 6 0
17 6 0
18 6 0
19 6 0
20 6 0
21 6 0
22 6 0
23 6 0
24 6 0
25 6 0
26 6 0
27 6 0
28 6 0
29 6 0
.. ... ...
212 8 1
213 8 1
214 8 1
215 9 1
216 9 1
217 9 1
218 9 1
219 9 1
220 9 1
221 9 1
222 9 1
223 9 1
224 9 1
225 9 1
226 9 1
227 9 1
228 9 1
229 9 1
230 9 1
231 9 1
232 9 1
233 9 1
234 9 1
235 9 1
236 9 1
237 9 1
238 9 1
239 9 1
240 9 1
241 9 1
[242 rows x 2 columns]
This is even more verbose than the last because mosaic()
demands labelled columns. It’s also slow. I want to avoid having to do this, but it’s the only way I’ve found so far.
Further, sometimes contingency data comes in crosstab form; in fact I can generate it from the above:
In [4]: pandas.crosstab(G2.IntelligentM, G2.DecisionM)
Out[4]:
DecisionM 0 1
IntelligentM
5 9 11
6 21 12
7 35 30
8 35 48
9 14 27
But mosaic()
fails badly on this:
In [7]: mosaic(pandas.crosstab(G2.IntelligentM, G2.DecisionM),["IntelligentM","DecisionM"])
...
KeyError: "['IntelligentM' 'DecisionM'] not in index"
Can mosaic() gain a freq=
param that shortcircuits [_normalize_dataframe()](https://github.com/statsmodels/statsmodels/blob/b235b7bff9890f4b6280f3c0f4a57c605f3b81e0/statsmodels/graphics/mosaicplot.py#L322)
and something similar to handle crosstabs? Or maybe it would be better–if less elegant–to be split to mosaic_contigency(), mosaic_items(), mosaic_crosstab().
contingency mode would operate on dicts {(var1, var2): count}
, item mode would operate on tables (var1, var2, count)
and crosstab mode would operate on matrices [var1, var2]: count
.
I’m on statsmodels 0.6.1 on Python3.
Issue Analytics
- State:
- Created 9 years ago
- Comments:7 (3 by maintainers)
If the input is a crosstab the easiest solution is to use “stack”:
tab = pd.crosstab(x, y) mosaic(tab.stack())
kshedden’s solution works for me.