pd.crosstab, categorical data and missing instances
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)
col_0 d e
row_0
a 1 0
b 0 1
c 0 0
Problem description
This is from the documentation:
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.
However, f is not included in the table while c is.
Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.
Thanks a lot in advance!
Expected Output
col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0
Output of pd.show_versions()
pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:15 (10 by maintainers)
Top GitHub Comments
Could a solution to this problem be to change the default of
dropna
to None instead of True? So ifdropna=None
would then depend on the dtype: False for categorical, True for other dtypes.I just tried this on master and got
which seems correct and in accordance with the description given alongside the example in the docs.
The only part which strikes me as not correct is that the docs still read
So, if that line in the docs is changed to
then can we close the issue? Changing the default type of
dropna
would be a breaking change, and I’m not sure it would be worth it