Decimal fields dropped in group by with more than one column
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
import decimal
df = pd.DataFrame({'a': [decimal.Decimal('4.56')]*6, 'b': range(3, 6)*2, 'c': range(6)})
print df.groupby('b')['a'].sum()
print df.groupby('b')['a', 'c'].sum()
print df.groupby('b').agg({'a': 'sum', 'c': 'sum'})
Output:
b
3 9.12
4 9.12
5 9.12
Name: a, dtype: object
c
b
3 3
4 5
5 7
a c
b
3 9.12 3
4 9.12 5
5 9.12 7
Problem description
The aggregation over column a is dropped when another field is accessed from the groupby object, but works when requested through agg
(I’m not sure if ‘sum’ is exactly equivalent to .sum()
in the above).
Expected Output
b
3 9.12
4 9.12
5 9.12
Name: a, dtype: object
a c
b
3 9.12 3
4 9.12 5
5 9.12 7
a c
b
3 9.12 3
4 9.12 5
5 9.12 7
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 2.7.15.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None
pandas: 0.23.4 pytest: None pip: 18.0 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: 0.8.0 xarray: None IPython: 5.7.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: 1.1.0 xlwt: None xlsxwriter: 1.0.4 lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.7 pymysql: None psycopg2: 2.7.4 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (9 by maintainers)
Top GitHub Comments
As a side note, we actually have this within our test suite, albeit in a fledgling state:
https://github.com/pandas-dev/pandas/blob/0370740034978d3a63d4b8e5e2c96ff54e7e08ba/pandas/tests/extension/decimal/array.py#L37-L118
This doesn’t quite get the job done though, as there’s currently no way to dynamically alter
ExtensionBlock.is_numeric
, which currently is alwaysFalse
, and is whatget_numeric_data
is ultimately looking at forDecimalArray
.Will open a separate issue for the above though, as it’s only tangentially related to this issue, and using
DecimalArray
would really only resolve this issue for theDecimal
case (i.e. a more generic solution forobject
dtype might be nice?).Another tactic would be to remove the
get_numeric_data
call and rely on the try/except later in the loop which should get raised if there is no appropriate aggregation function.I have tested this and it restores consistency between the two methods. Happy to PR if we can be confident this won’t break other logic.