question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using groupby with custom index

See original GitHub issue

Hello,

I have 6 hourly data (ERA Interim) for around 10 years. I want to calculate the annual 6 hourly climatology, i.e, 366*4 values, with each value corresponding to a 6 hourly interval. I am chunking the data along longitude. I’m using xarray 0.9.1 with Python 3.6 (Anaconda).

For a daily climatology on this data, I do the usual:

mean = data.groupby('time.dayofyear').mean(dim='time').compute()

For the 6 hourly version, I am trying the following:

test = (data['time.hour']/24 + data['time.dayofyear'])
test.name = 'dayHourly'
new_test = data.groupby(test).mean(dim='time').compute()

The first one (daily climatology) takes around 15 minutes for my data, whereas the second one ran for almost 30 minutes after which I gave up and killed the process.

Is there some obvious reason why the first is much faster than the second? data in both cases is the 6 hourly dataset. And is there an alternative way of expressing this computation which would make it faster?

TIA, Joy

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
shoyercommented, Mar 14, 2017

We currently do all the groupby handling ourselves, which means that when you group over smaller units the dask graph gets bigger and each of the tasks gets smaller. Given that each chunk in the grouped data is only about ~250,000 elements, it’s not surprising that things get a bit slower – that’s near the point where Python overhead starts to get significant.

It would be useful to benchmark graph creation and execution separately (especially using dask-distributed’s profiling tools) to understand where the slow-down is.

One thing that might help quite a bit in cases like this where the individual groups are small is to rewrite xarray’s groupby to do some groupby operations inside dask, rather than in a loop outside of dask. That would allow executing tasks on bigger chunks of arrays at once, which could significantly reduce scheduler overhead.

2reactions
rabernatcommented, Mar 14, 2017

Slightly OT observation: Performance issues are increasingly being raised here (see also #1301). Wouldn’t it be great if we had shared space somewhere in the cloud to host these big-ish datasets and run performance benchmarks in a controlled environment?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas group by custom frequency and get groups of indexes
Apparently there is a better way than groups to get the indices, namely indices : auctions.groupby(pd.
Read more >
Applying Custom Functions to Groups of Data in Pandas
Here, I will share with you two different methods for applying custom functions to groups of data in pandas. There are many out-of-the-box ......
Read more >
pandas: Advanced groupby(), apply() and MultiIndex
groupby () functions using apply() We can design our own custom functions -- we simply use apply() and pass a function (you might...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
If a non-unique index is used as the group key in a groupby operation, ... Users can also provide their own functions for...
Read more >
How to GroupBy Index in Pandas? - Spark by {Examples}
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found