Aggregate mean over TimeDelta column
See original GitHub issueCannot compute mean of timedelta in gropuby
In [1]: import pandas as pd
...: import numpy as np
...: print(pd.__version__)
...:
...: time = pd.to_timedelta(np.random.randint(100,1000, size=10), unit='s')
...: cat = np.random.choice(['A','B','C'], size=10)
...:
...: df = pd.DataFrame(dict(time=time, cat=cat))
...:
...: print(df['time'].mean())
...:
...: df.groupby('cat')['time'].mean()
...:
0.20.3
0 days 00:07:15.800000
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-1-b649ebd78333> in <module>()
10 print(df['time'].mean())
11
---> 12 df.groupby('cat')['time'].mean()
/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs)
1035 nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
1036 try:
-> 1037 return self._cython_agg_general('mean', **kwargs)
1038 except GroupByError:
1039 raise
/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only)
834
835 if len(output) == 0:
--> 836 raise DataError('No numeric types to aggregate')
837
838 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
Problem description
The mean of TimeDelta
can be computed, but it is not working in GroupBy.
Expected Output
This is the output I would expect.
In [3]: def average_time(x):
...: s = x.dt.seconds
...: return pd.Timedelta(s.mean(), unit='s')
...:
...: df.groupby('cat')['time'].apply(average_time)
...:
Out[3]:
cat
A 00:09:27.666667
B 00:06:18.666667
C 00:06:19.750000
Name: time, dtype: timedelta64[ns]
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.0.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: 0.2.1
Issue Analytics
- State:
- Created 6 years ago
- Reactions:5
- Comments:9 (4 by maintainers)
Top GitHub Comments
The cython path will handle timedeltas as below
Arguably not a great API - we automatically exclude “non-numeric” columns on groupby ops. This makes sense with an entire DataFrame, but perhaps on a Series selection this should default to true?
I think this is just because we don’t have a cython implementation of
group_mean
in https://github.com/pandas-dev/pandas/blob/ad7d6fc0248edaf098537e5674dcc0c9dd059491/pandas/_libs/groupby_helper.pxi.in#L31 if you’re interested in taking a lookFYI, a slightly smaller workaround is
df.groupby('cat')['time'].agg(lambda x: x.mean())