Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ambiguous behaviour when `transform` `groupby` with `NaN`s

See original GitHub issue

Similar issues: #10923, #9697, #9941

Please, consider the following data:

import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
                       'B':numpy.random.rand(20)*10,
                       'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None

Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn’t.

In [41]: df.groupby('C')['B'].transform('mean')
Out[41]: 
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5     5.670891
6     5.335332
7     0.580197
8     5.670891
9     5.670891
10    1.628290
11    1.628290
12    5.670891
13    8.493416
14    5.670891
15    8.493416
16    5.335332
17    5.670891
18    5.670891
19    5.335332
Name: B, dtype: float64

In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
   3061 
   3062         result.name = self._selected_obj.name
-> 3063         result.index = self._selected_obj.index
   3064         return result
   3065 

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   3092         try:
   3093             object.__getattribute__(self, name)
-> 3094             return object.__setattr__(self, name, value)
   3095         except AttributeError:
   3096             pass

pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
    306         object.__setattr__(self, '_index', labels)
    307         if not fastpath:
--> 308             self._data.set_axis(axis, labels)
    309 
    310     def _set_subtyp(self, is_all_dates):

~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2834             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2835                              'new values have %d elements' %
-> 2836                              (old_len, new_len))
   2837 
   2838         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements

The first one, using 'mean', is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation. Note: The second one, with lambda function, used to work on (pandas) version 0.19.1

I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.

Thanks

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.12.1 scipy: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

jsevocommented, Jun 26, 2018

Edit: This was because there were none values in the grouping column, groups. Filling them first with dummies gets around the issue.

~~I encounter this when trying to fill missing values per group:~~

def most_common_in_group(g):
    try:
        mc = g.value_counts().index[0]
        return(mc)
    except IndexError:
        return('all_missing')


df.groupby('groups')['sometime_missing_values'].transform(most_common_in_group)

1reaction

mroeschkecommented, Jun 12, 2021

This looks to work on master now. Could use a test

In [17]: df = pd.DataFrame({'A':[1,np.nan],'B':[1,1]})

In [18]: df.groupby('A').transform(lambda x:x)
Out[18]:
   B
0  1

Top Results From Across the Web

python - unexpected behavior pandas groupby transform

I am reading the 'Python for Data Analysis' book and I was working through an example as prototyped below. import pandas as pd...

Working with missing data — pandas 1.5.2 documentation

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example: >>> In [40]: df Out[40]: one two three...

Skip NA in Mean function within Pandas agg function [closed]

The only scenario well you get NaN, is when NaN is the only value. Then, the mean value of an empty set, gives...

Strange Behavior With Pandas Group By - Transform On ...

To restore the legacy behavior you can set spark.sql.legacy. '+Infinity' 'Infinity' 'NaN' 'Inf' ... Ambiguous behaviour when transform groupby with NaN s ...

Pandas groupby() and count() with Examples

Use pandas DataFrame.groupby() to group the rows by column and use count() method to get the count for each group by ignoring None...