Ambiguous behaviour when `transform` `groupby` with `NaN`s
See original GitHub issueSimilar issues: #10923, #9697, #9941
Please, consider the following data:
import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
'B':numpy.random.rand(20)*10,
'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None
Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name, the second one, a lambda function. The first one works, the second, doesn’t.
In [41]: df.groupby('C')['B'].transform('mean')
Out[41]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 5.670891
6 5.335332
7 0.580197
8 5.670891
9 5.670891
10 1.628290
11 1.628290
12 5.670891
13 8.493416
14 5.670891
15 8.493416
16 5.335332
17 5.670891
18 5.670891
19 5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
3061
3062 result.name = self._selected_obj.name
-> 3063 result.index = self._selected_obj.index
3064 return result
3065
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
3092 try:
3093 object.__getattribute__(self, name)
-> 3094 return object.__setattr__(self, name, value)
3095 except AttributeError:
3096 pass
pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
306 object.__setattr__(self, '_index', labels)
307 if not fastpath:
--> 308 self._data.set_axis(axis, labels)
309
310 def _set_subtyp(self, is_all_dates):
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
2834 raise ValueError('Length mismatch: Expected axis has %d elements, '
2835 'new values have %d elements' %
-> 2836 (old_len, new_len))
2837
2838 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements
The first one, using 'mean', is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda function, used to work on (pandas) version 0.19.1
I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.
Thanks
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.12.1 scipy: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:8 (5 by maintainers)

Top Related StackOverflow Question
Edit: This was because there were none values in the grouping column,
groups. Filling them first with dummies gets around the issue.I encounter this when trying to fill missing values per group:This looks to work on master now. Could use a test