Feature Request: Dask Dataframe to Support Transform Method
See original GitHub issueIt would be great if dask dataframe supported the transform
method that is similar to pandas transform.
It is very convenient to split-apply
and then transform
which can be used, for example, to append the group means (or other transformations) and return a single dataframe.
Here is a good example of its use case: http://pbpython.com/pandas_transform.html
Issue Analytics
- State:
- Created 6 years ago
- Reactions:8
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Futures - Dask documentation
Dask supports a real-time task framework that extends Python's concurrent.futures interface. ... You can submit individual tasks using the submit method:.
Read more >How to Convert a pandas Dataframe into a Dask ... - Coiled
This function splits the in-memory pandas DataFrame into multiple sections and creates a Dask DataFrame. We can then operate on the Dask ......
Read more >Feature transformation with Amazon SageMaker Processing ...
dask -scheduler process: coordinates the actions of several workers. The scheduler is asynchronous and event-driven, simultaneously responding to requests ...
Read more >Dask - How to handle large dataframes in python using ...
A very simple way is to use the dask.delayed decorator to implement parallel processing. Let me explain it through an example. Consider the ......
Read more >Using Dask EntitySets (BETA) - What is Featuretools? - Alteryx
If you are already familiar with creating a feature matrix starting from pandas DataFrames, this process will seem quite familiar, as there are...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am trying to emulate the
transform
behavior usingmap
but I always hit some wall.Goal: create column with the difference between weight and the mean of the corresponding height group. First we need to assign the mean weight of the group to each observation.
TL;DR: what can I do to avoid “ValueError: Not all divisions are known, can’t align partitions. Please use
set_index
to set the index.” when usingmap_partitions
?The dataset
Note: triggering the computation by calling
head()
Note: the divisions were defined by callingset_index('index')
Approach 1 (simplest):
Calling map directly doesn’t work:
But it works if we compute the groupby aggregator first:
However, this is not what we want because Dask should allow us to defer the graph computation until the last possible moment.
Approach 2 (
map_partitions
+map
):This approach is curious, because it works if
npartition=1
:But fails with more partitions:
I’m ignorant about Dask internals to understand what’s happening, but I believe the partitions in
df
are known and that themap
operation doesn’t change the index in each partition nor their size!Approach 3 (
apply
+loc
):Yet another option, but this time, a possibly much heavier one. However, it doesn’t work because Dask doesn’t seem to be propagating the computation up to the last node:
Note: the
max()
afterloc
is needed becausedd.Series.loc
returns add.Series
instead of a value, but we know beforehand that each key is unique (a possible optimization here?).Any luck on this? I have a click log consisting of 200 million rows. I’m attempting to create features by frequency. So for instance if I have IP and channel as a feature I’m attempting to add an ip_channel_frequency feature to the same dataframe.
ip channel 123 12 123 12 145 49
Should look like ip channel frequency 123 12 2
123 12 2 145 49 1
df[‘ip_channel_frequency’] = df.groupby([‘ip’,‘channel’],ip.transform(‘size’)
Doesn’t work due to the transform method not being implemented