question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Groupby inconsistency with categorical values

See original GitHub issue

Code Sample

df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])

# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()

df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()

Problem description

I’m using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that’s expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby like empty_groups? Or maybe even a simpler solution exists, but I couldn’t find it.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.16-gentoo machine: x86_64 processor: Intel® Xeon® CPU E5-1650 v4 @ 3.60GHz byteorder: little LC_ALL: None LANG: en_US.utf8 LOCALE: en_US.UTF-8

pandas: 0.20.2 pytest: 3.1.2 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: 1.6.2 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.8.0 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Jul 20, 2017

see much discussion https://github.com/pandas-dev/pandas/issues/8559

closing as a duplicate.

0reactions
toobazcommented, Jul 20, 2017

@toobaz if I understand you correctly, then it probably won’t work in general

Yes, you understood me correctly… and yes, you’re right…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistent behavior when groupby pandas Categorical ...
So first, I think everyone agrees that there is an inconsistent issue of groupby and sum after grouping by between one column and...
Read more >
Be Careful When Using Pandas Groupby with Categorical ...
The trick part is when we have missing values in a column with category data type. Let's add a new row with a...
Read more >
Pandas groupby with categories with redundant nan
Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby...
Read more >
Grouping Categorical Variables in Pandas Dataframe
Now, in some works, we need to group our categorical data. This is done using the groupby() method given in pandas. It returns...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
By “group by” we are referring to a process involving one or more of the following steps: Splitting the data into groups based...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found