Ambiguous behaviour when `transform` `groupby` with `NaN`s
See original GitHub issueSimilar issues: #10923, #9697, #9941
Please, consider the following data:
import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
'B':numpy.random.rand(20)*10,
'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None
Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a string function name
, the second one, a lambda
function. The first one works, the second, doesn’t.
In [41]: df.groupby('C')['B'].transform('mean')
Out[41]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 5.670891
6 5.335332
7 0.580197
8 5.670891
9 5.670891
10 1.628290
11 1.628290
12 5.670891
13 8.493416
14 5.670891
15 8.493416
16 5.335332
17 5.670891
18 5.670891
19 5.335332
Name: B, dtype: float64
In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
3061
3062 result.name = self._selected_obj.name
-> 3063 result.index = self._selected_obj.index
3064 return result
3065
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
3092 try:
3093 object.__getattribute__(self, name)
-> 3094 return object.__setattr__(self, name, value)
3095 except AttributeError:
3096 pass
pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
306 object.__setattr__(self, '_index', labels)
307 if not fastpath:
--> 308 self._data.set_axis(axis, labels)
309
310 def _set_subtyp(self, is_all_dates):
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
2834 raise ValueError('Length mismatch: Expected axis has %d elements, '
2835 'new values have %d elements' %
-> 2836 (old_len, new_len))
2837
2838 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements
The first one, using 'mean'
, is what I was expecting. By all means, it looks strange to me that we have two different behaviours for the same operation.
Note: The second one, with lambda
function, used to work on (pandas) version 0.19.1
I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around.
Thanks
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 3.16.0-38-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.12.1 scipy: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:8 (5 by maintainers)
Top GitHub Comments
Edit: This was because there were none values in the grouping column,
groups
. Filling them first with dummies gets around the issue.I encounter this when trying to fill missing values per group:This looks to work on master now. Could use a test