DataFrame groupby is extremely slow when grouping by a column of pandas Period values
See original GitHub issueSteps to reproduce
In [1]: import pandas
In [2]: import datetime
In [3]: months = [(2017, month) for month in range(1, 11)] * 10000
In [5]: month_pydates = pandas.Series([datetime.date(year, month, 1) for year, month in months])
In [7]: df = pandas.DataFrame({
'x': list(range(len(months))),
'month_periods': pandas.to_datetime(month_pydates).dt.to_period('M'),
'month_pydates': month_pydates,
'month_int': [year * 100 + month for year, month in months]})
In [8]: df.head()
Out[8]:
month_int month_periods month_pydates x
0 201701 2017-01 2017-01-01 0
1 201702 2017-02 2017-02-01 1
2 201703 2017-03 2017-03-01 2
3 201704 2017-04 2017-04-01 3
4 201705 2017-05 2017-05-01 4
In [9]: df.dtypes
Out[9]:
month_int int64
month_periods object
month_pydates object
x int64
dtype: object
In [9]: df.loc[0, 'month_periods']
Out[9]: Period('2017-01', 'M')
In [10]: %timeit df.groupby('month_int')['x'].sum()
100 loops, best of 3: 2.32 ms per loop
In [11]: %timeit df.groupby('month_pydates')['x'].sum()
100 loops, best of 3: 6.7 ms per loop
In [12]: %timeit df.groupby('month_periods')['x'].sum()
1 loop, best of 3: 2.37 s per loop
Problem description
When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, when compared to grouping by columns of integers or by columns of Python objects.
In the example above, a DataFrame with 120,000 rows is created, and a groupby operation is performed on three columns. On the integer column, the groupby-sum took 2.3 milliseconds; on the column containing datetime.date objects, the groupby-sum took 6.7 milliseconds; and on the column containing pandas.Period objects, the groupby-sum took 2.4 seconds.
Note that in this case, the dtype of the 'month_periods'
column is object
. I attempted to convert this column to a period-specific data type using df['month_periods'] .astype('period[M]')
, but this lead to a TypeError: TypeError: data type "period[M]" not understood
.
In any case, the series was returned by .dt.to_period('M')
, so I would expect this to be a well-formed series of periods.
Expected Behavior
When grouping on a period column, it should be possible to group by the underlying integer values used for storing periods, and thus the performance should roughly match the performance of grouping by integers.
In the worst case, the performance should match the performance of comparing small Python objects (i.e. those with trivial __eq__
functions).
Workaround
Making the column categorical avoids the performance hit, and roughly matches the integer column performance:
In [21]: df['month_periods'] = df['month_periods'].astype('category')
In [22]: %timeit df.groupby('month_periods')['x'].sum()
100 loops, best of 3: 1.97 ms per loop
Output of pd.show_versions()
In [16]: pandas.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:7 (7 by maintainers)
Top GitHub Comments
@jreback, it is fine that a series of pandas Periods has dtype
object
.But grouping by
pandas.Period
objects is about 300 times slower than grouping by other series withdtype: object
, such as series ofdatetime.date
objects or simple tuples. (I’m comparing 2.4 seconds to about 7 milliseconds; see the second timing invocation in the original report, or the example below.)That’s great news, thank you very much!