question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame groupby is extremely slow when grouping by a column of pandas Period values

See original GitHub issue

Steps to reproduce

In [1]: import pandas

In [2]: import datetime

In [3]: months = [(2017, month) for month in range(1, 11)] * 10000

In [5]: month_pydates = pandas.Series([datetime.date(year, month, 1) for year, month in months])

In [7]: df = pandas.DataFrame({
    'x': list(range(len(months))),
    'month_periods': pandas.to_datetime(month_pydates).dt.to_period('M'),
    'month_pydates': month_pydates,
    'month_int': [year * 100 + month for year, month in months]})

In [8]: df.head()
Out[8]:
   month_int month_periods month_pydates  x
0     201701       2017-01    2017-01-01  0
1     201702       2017-02    2017-02-01  1
2     201703       2017-03    2017-03-01  2
3     201704       2017-04    2017-04-01  3
4     201705       2017-05    2017-05-01  4

In [9]: df.dtypes
Out[9]:
month_int         int64
month_periods    object
month_pydates    object
x                 int64
dtype: object

In [9]: df.loc[0, 'month_periods']
Out[9]: Period('2017-01', 'M')

In [10]: %timeit  df.groupby('month_int')['x'].sum()
100 loops, best of 3: 2.32 ms per loop

In [11]: %timeit  df.groupby('month_pydates')['x'].sum()
100 loops, best of 3: 6.7 ms per loop

In [12]: %timeit  df.groupby('month_periods')['x'].sum()
1 loop, best of 3: 2.37 s per loop

Problem description

When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, when compared to grouping by columns of integers or by columns of Python objects.

In the example above, a DataFrame with 120,000 rows is created, and a groupby operation is performed on three columns. On the integer column, the groupby-sum took 2.3 milliseconds; on the column containing datetime.date objects, the groupby-sum took 6.7 milliseconds; and on the column containing pandas.Period objects, the groupby-sum took 2.4 seconds.

Note that in this case, the dtype of the 'month_periods' column is object. I attempted to convert this column to a period-specific data type using df['month_periods'] .astype('period[M]'), but this lead to a TypeError: TypeError: data type "period[M]" not understood.

In any case, the series was returned by .dt.to_period('M'), so I would expect this to be a well-formed series of periods.

Expected Behavior

When grouping on a period column, it should be possible to group by the underlying integer values used for storing periods, and thus the performance should roughly match the performance of grouping by integers.

In the worst case, the performance should match the performance of comparing small Python objects (i.e. those with trivial __eq__ functions).

Workaround

Making the column categorical avoids the performance hit, and roughly matches the integer column performance:

In [21]: df['month_periods'] = df['month_periods'].astype('category')

In [22]: %timeit  df.groupby('month_periods')['x'].sum()
100 loops, best of 3: 1.97 ms per loop

Output of pd.show_versions()


In [16]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
nmusolinocommented, Oct 31, 2017

@jreback, it is fine that a series of pandas Periods has dtype object.

But grouping by pandas.Period objects is about 300 times slower than grouping by other series with dtype: object, such as series of datetime.date objects or simple tuples. (I’m comparing 2.4 seconds to about 7 milliseconds; see the second timing invocation in the original report, or the example below.)

In [25]: df['month_tuples'] = months

In [26]: df[['month_tuples', 'month_periods']].head()
Out[26]:
  month_tuples month_periods
0    (2017, 1)       2017-01
1    (2017, 2)       2017-02
2    (2017, 3)       2017-03
3    (2017, 4)       2017-04
4    (2017, 5)       2017-05

In [27]: %timeit  df.groupby('month_tuples')['x'].sum()
100 loops, best of 3: 7.18 ms per loop

In [28]: %timeit  df.groupby('month_periods')['x'].sum()
1 loop, best of 3: 2.36 s per loop

In [29]: df[['month_tuples', 'month_periods']].dtypes
Out[29]:
month_tuples     object
month_periods    object
dtype: object
0reactions
nmusolinocommented, Oct 25, 2018

That’s great news, thank you very much!

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Pandas: df.groupby() is too slow for big data set. Any ...
The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:
Read more >
Why this groupby code is so slow? - Google Groups
The problem is that t = self.heart_beat.loc[i, 'timestamp'] is very, very slow and represents ~85% of all time spent to finish the processing....
Read more >
Group By: split-apply-combine — pandas 0.17.1 documentation
For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but...
Read more >
[Solved]-Pandas groupby apply performing slow-Pandas,Python
The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified....
Read more >
How to GroupBy with Python Pandas Like a Boss - Just into Data
The resulting output is a DataFrame with the group name as the index. Example(s) #1: Single Aggregating Function on Multiple Columns. Let's see ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found