Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: groupby-fillna perf, implement in cython

See original GitHub issue

Hello,

I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don’t find anyway to do this a in a very efficient way. I’m using Pandas 0.16.2 and numpy 1.9.2.

Here is the type of dataframe I have :

Dataframe example

And here is the result I want :

So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill forward function applied on the “group by dataframe” is really too slow.

Here is the code I use to compare simple fill forward (which doesn’t give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Here are the time results I have:

So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.

Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?

Thanks

Issue Analytics

State:
Created 8 years ago
Reactions:2
Comments:10 (7 by maintainers)

Top GitHub Comments

1reaction

alonmecommented, Mar 21, 2022

Hey @jreback - i found this issue through this SO comment https://stackoverflow.com/a/43251227/7581507.

I suspected that the proposed optimization (Based on your code AFAIU) wouldn’t make a big difference as i see here that related cythonization has been added to the code.

However - using the discussed method, i got a speedup from 8 minutes to ~2 seconds. is this still expected to make such a big change? can we do anything to speedup the groupby.ffill?

0reactions

WillAydcommented, Feb 10, 2018

@jreback while I’m working on Cython optimizations I can take a look at this one. Just curious if we view the fact that ffill and bfillretain the grouping in it’s own column as a feature or something up for discussion. To illustrate:

In []: df = pd.DataFrame({'key': ['a']*5, 'val': range(5)})
In []: df.groupby('key').rank()
Out []:
   val
0  1.0
1  2.0
2  3.0
3  4.0
4  5.0

In []: df.groupby('key').ffill()  # retains key in output; same for bfill
Out []:
  key  val
0   a    0
1   a    1
2   a    2
3   a    3
4   a    4

If bfill and ffill didn’t return the grouping in a fashion similar to rank then we could leverage the same call signatures as the rest of the transformations

Top Results From Across the Web

Improving the performance of pandas groupby - Stack Overflow

The trick is to avoid apply / transform in any form as much as possible. Avoid them like the plague. They're basically implemented...

Group by: split-apply-combine — pandas 1.5.2 documentation

We'll address each area of GroupBy functionality then provide some non-trivial examples / use cases. See the cookbook for some advanced strategies.

7.1 Cython (Writing C extensions for pandas) - GitHub Pages

This tutorial walks through a “typical” process of cythonizing a slow computation. We use an example from the cython documentation but in the...

Improving the performance of pandas groupby-Pandas,Python

In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side...

Dask Best Practices - Dask documentation

Use better algorithms or data structures: NumPy, pandas, Scikit-learn may have ... snappy, and Z-Standard that provide better performance and random access.