question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PERF: groupby-fillna perf, implement in cython

See original GitHub issue

Hello,

I work on a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NaN values in the columns (Value, Id). I want to fill forward value but by Date only and I don’t find anyway to do this a in a very efficient way. I’m using Pandas 0.16.2 and numpy 1.9.2.

Here is the type of dataframe I have :

Dataframe example image

And here is the result I want :

image

So to properly fillback by date I can use groupby(level=0) function. The groupby call is fast but the fill forward function applied on the “group by dataframe” is really too slow.

Here is the code I use to compare simple fill forward (which doesn’t give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Here are the time results I have:

image

So, the fill foward on the group by dataframe is 10000 times slower than the simple fill forward. I cannot understand why pandas is running so slowly. I need to have comparable perfs with the simple fill forward so just a couple of milliseconds.

Could somebody address the performance issue? Or give me a solution to do this kind of action in a very efficient way?

Thanks

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Reactions:2
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
alonmecommented, Mar 21, 2022

Hey @jreback - i found this issue through this SO comment https://stackoverflow.com/a/43251227/7581507.

I suspected that the proposed optimization (Based on your code AFAIU) wouldn’t make a big difference as i see here that related cythonization has been added to the code.

However - using the discussed method, i got a speedup from 8 minutes to ~2 seconds. is this still expected to make such a big change? can we do anything to speedup the groupby.ffill?

0reactions
WillAydcommented, Feb 10, 2018

@jreback while I’m working on Cython optimizations I can take a look at this one. Just curious if we view the fact that ffill and bfillretain the grouping in it’s own column as a feature or something up for discussion. To illustrate:

In []: df = pd.DataFrame({'key': ['a']*5, 'val': range(5)})
In []: df.groupby('key').rank()
Out []:
   val
0  1.0
1  2.0
2  3.0
3  4.0
4  5.0

In []: df.groupby('key').ffill()  # retains key in output; same for bfill
Out []:
  key  val
0   a    0
1   a    1
2   a    2
3   a    3
4   a    4

If bfill and ffill didn’t return the grouping in a fashion similar to rank then we could leverage the same call signatures as the rest of the transformations

Read more comments on GitHub >

github_iconTop Results From Across the Web

Improving the performance of pandas groupby - Stack Overflow
The trick is to avoid apply / transform in any form as much as possible. Avoid them like the plague. They're basically implemented...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
We'll address each area of GroupBy functionality then provide some non-trivial examples / use cases. See the cookbook for some advanced strategies.
Read more >
7.1 Cython (Writing C extensions for pandas) - GitHub Pages
This tutorial walks through a “typical” process of cythonizing a slow computation. We use an example from the cython documentation but in the...
Read more >
Improving the performance of pandas groupby-Pandas,Python
In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side...
Read more >
Dask Best Practices - Dask documentation
Use better algorithms or data structures: NumPy, pandas, Scikit-learn may have ... snappy, and Z-Standard that provide better performance and random access.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found