Groupby inconsistency with categorical values
See original GitHub issueCode Sample
df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])
# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()
df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()
Problem description
I’m using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that’s expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby
like empty_groups
? Or maybe even a simpler solution exists, but I couldn’t find it.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.16-gentoo machine: x86_64 processor: Intel® Xeon® CPU E5-1650 v4 @ 3.60GHz byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8
pandas: 0.20.2 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (6 by maintainers)
Top GitHub Comments
see much discussion https://github.com/pandas-dev/pandas/issues/8559
closing as a duplicate.
Yes, you understood me correctly… and yes, you’re right…