Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature Request: Dask Dataframe to Support Transform Method

See original GitHub issue

It would be great if dask dataframe supported the transform method that is similar to pandas transform.

It is very convenient to split-apply and then transform which can be used, for example, to append the group means (or other transformations) and return a single dataframe.

Here is a good example of its use case: http://pbpython.com/pandas_transform.html

Issue Analytics

State:
Created 6 years ago
Reactions:8
Comments:11 (6 by maintainers)

Top GitHub Comments

3reactions

carlosdanielcsantoscommented, Jul 4, 2018

I am trying to emulate the transform behavior using map but I always hit some wall.

Goal: create column with the difference between weight and the mean of the corresponding height group. First we need to assign the mean weight of the group to each observation.

TL;DR: what can I do to avoid “ValueError: Not all divisions are known, can’t align partitions. Please use set_index to set the index.” when using map_partitions?

The dataset

df.head()

   index  height    weight
0      0      65  112.9925
1      1      71  136.4873
2      2      69  153.0269
3      3      68  142.3354
4      4      67  144.2971

Note: triggering the computation by calling head() Note: the divisions were defined by calling set_index('index')

Approach 1 (simplest):

Calling map directly doesn’t work:

>> mean_weight_for_height = df.groupby('height')['weight'].mean()
>> df['height'].map(mean_weight_for_height).head()

TypeError: arg must be pandas.Series, dict or callable. Got <class 'dask.dataframe.core.Series'>

But it works if we compute the groupby aggregator first:

>> df['height'].map(mean_weight_for_height.compute()).head()

0    119.908594
1    137.759168
2    132.019763
3    128.383103
4    125.515319
Name: height, dtype: float64

However, this is not what we want because Dask should allow us to defer the graph computation until the last possible moment.

Approach 2 (`map_partitions` + `map`):

This approach is curious, because it works if npartition=1:

>> df['height'].map_partitions(lambda x, _map: x.map(_map), mean_weight_for_height, meta=(str, float)).head()

0    119.908594
1    137.759168
2    132.019763
3    128.383103
4    125.515319
Name: height, dtype: float64

But fails with more partitions:

ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

I’m ignorant about Dask internals to understand what’s happening, but I believe the partitions in df are known and that the map operation doesn’t change the index in each partition nor their size!

Approach 3 (`apply` + `loc`):

Yet another option, but this time, a possibly much heavier one. However, it doesn’t work because Dask doesn’t seem to be propagating the computation up to the last node:

>> df['height'].apply(lambda x: mean_weight_for_height.loc[x].max(), meta=(str, float)).head()

0    dd.Scalar<series-..., dtype=float64>
1    dd.Scalar<series-..., dtype=float64>
2    dd.Scalar<series-..., dtype=float64>
3    dd.Scalar<series-..., dtype=float64>
4    dd.Scalar<series-..., dtype=float64>
Name: height, dtype: object

Note: the max() after loc is needed because dd.Series.loc returns a dd.Series instead of a value, but we know beforehand that each key is unique (a possible optimization here?).

2reactions

onacramecommented, Apr 14, 2018

Any luck on this? I have a click log consisting of 200 million rows. I’m attempting to create features by frequency. So for instance if I have IP and channel as a feature I’m attempting to add an ip_channel_frequency feature to the same dataframe.

ip channel 123 12 123 12 145 49

Should look like ip channel frequency 123 12 2
123 12 2 145 49 1

df[‘ip_channel_frequency’] = df.groupby([‘ip’,‘channel’],ip.transform(‘size’)

Doesn’t work due to the transform method not being implemented

Top Results From Across the Web

Futures - Dask documentation

Dask supports a real-time task framework that extends Python's concurrent.futures interface. ... You can submit individual tasks using the submit method:.

How to Convert a pandas Dataframe into a Dask ... - Coiled

This function splits the in-memory pandas DataFrame into multiple sections and creates a Dask DataFrame. We can then operate on the Dask ......

Feature transformation with Amazon SageMaker Processing ...

dask -scheduler process: coordinates the actions of several workers. The scheduler is asynchronous and event-driven, simultaneously responding to requests ...

Dask - How to handle large dataframes in python using ...

A very simple way is to use the dask.delayed decorator to implement parallel processing. Let me explain it through an example. Consider the ......

Using Dask EntitySets (BETA) - What is Featuretools? - Alteryx

If you are already familiar with creating a feature matrix starting from pandas DataFrames, this process will seem quite familiar, as there are...