Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.crosstab, categorical data and missing instances

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Comments:15 (10 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Sep 1, 2018

Could a solution to this problem be to change the default of dropna to None instead of True? So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

0reactions

MarcoGorellicommented, Feb 22, 2021

I just tried this on master and got

>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0      
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

which seems correct and in accordance with the description given alongside the example in the docs.

The only part which strikes me as not correct is that the docs still read

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

So, if that line in the docs is changed to

When using dropna=False, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

then can we close the issue? Changing the default type of dropna would be a breaking change, and I’m not sure it would be worth it