question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pd.crosstab, categorical data and missing instances

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
pd.crosstab(foo, bar)

col_0  d  e
row_0      
a      1  0
b      0  1
c      0  0

Problem description

This is from the documentation:

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

However, f is not included in the table while c is.

Please let me know if this is in fact a bug, then I will be glad to write give writing a patch a try.

Thanks a lot in advance!

Expected Output

col_0 d e f row_0 a 1 0 0 b 0 1 0 c 0 0 0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1 pytest: 2.8.7 pip: 8.1.1 setuptools: 20.7.0 Cython: None numpy: 1.12.1 scipy: 0.17.0 xarray: None IPython: None sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 1.5.1 openpyxl: 2.3.0 xlrd: 0.9.4 xlwt: 0.7.5 xlsxwriter: None lxml: 3.5.0 bs4: 4.4.1 html5lib: 0.999 sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:15 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Sep 1, 2018

Could a solution to this problem be to change the default of dropna to None instead of True? So if dropna=None would then depend on the dtype: False for categorical, True for other dtypes.

0reactions
MarcoGorellicommented, Feb 22, 2021

I just tried this on master and got

>>> import pandas as pd
>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0      
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

which seems correct and in accordance with the description given alongside the example in the docs.

The only part which strikes me as not correct is that the docs still read

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

So, if that line in the docs is changed to

When using dropna=False, any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

then can we close the issue? Changing the default type of dropna would be a breaking change, and I’m not sure it would be worth it

Read more comments on GitHub >

github_iconTop Results From Across the Web

3 easy ways to crosstab in pandas - Towards Data Science
A cross-tabulation is simple but effective way to inspect relationship between two or more categorical or discrete variables.
Read more >
Show missing (NA) values in pandas.crosstab() - Stack Overflow
One option would be to add_categories to FOO then fillna with the new added NaN representation:
Read more >
pandas.crosstab — pandas 1.5.2 documentation
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not...
Read more >
How To Have A Cross Tabulation For Categorical Data In ...
import pandas as pd import altair as alt import numpy as np from scipy ... as pd foo pd. pd.crosstab categorical data and...
Read more >
7.3. Working with Categorical Data
crosstab (). import pandas as pd import matplotlib.pyplot as plt ... on that column that takes any instance of the first argument we...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found