question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dask shuffle performance help

See original GitHub issue

Hi Folks,

I’m experimenting with a new shuffle algorithm for Dask dataframe. This is what backs distributed versions of join, set_index, groupby-apply, or anything that requires the large movement of rows around a distributed dataframe.

Things are coming along well, but I’m running into a performance challenge with pandas and I would like to solicit feedback here before diving more deeply. If this isn’t the correct venue please let me know and I’ll shift this elsewhere.

We’ve constructed a script (thanks @gjoseph92 for starting this) that creates a random dataframe and a column on which to split, rearrange, serialize/deserialize, and concat a couple of times. This is representative of the operations that we’re trying to do, except that in between a couple of steps there the shards/groups end up coming from different machines, rather than being the same shards

import time
import random
import pickle

import numpy as np
import pandas as pd
    

# Parameters
n_groups = 10_000
n_cols = 1000
n_rows = 30_000
    
# Make input data
df = pd.DataFrame(np.random.random((n_rows, n_cols)))
df["partitions"] = (df[0] * n_groups).astype(int)  # random values 0..10000

start = time.time()
_, groups = zip(*df.groupby("partitions"))  # split into many small shards

groups = list(groups)
random.shuffle(groups)  # rearrange those shards

groups = [pickle.dumps(group) for group in groups]  # Simulate sending across the network
groups = [pickle.loads(group) for group in groups]

df = pd.concat(groups)  # reassemble shards
_, groups = zip(*df.groupby("partitions"))  # and resplit


stop = time.time()

import dask
print(dask.utils.format_bytes(df.memory_usage().sum() / (stop - start)), "/s")

With 10,000 groups I get around 40 MB/s bandwidth.

With 1,000 groups I get around 230 MB/s bandwidth

With 200 or fewer groups I get around 500 MB/s bandwidth

Obviously, one answer here is “use fewer groups, Pandas isn’t designed to operate efficiently with only a few rows”. That’s fair and we’re trying to design for that, but there is always pressure to shrink things down so I would like to explore and see if there is anything we can do on the pandas side to add tolerance here.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:18 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jbrockmendelcommented, Sep 6, 2021

Below I’m pasting an implementation of what I have in mind. Its about 6x faster than pd.concat on the example in the OP. I’m going to ask you to take the baton to a) test it against a wide variety of DataFrames and b) profile it

import numpy as np
import pandas as pd
import pandas._testing as tm


def concat_known_aligned(frames: list[pd.DataFrame]):
    """
    pd.concat(frames, axis=0) specialized to the case
    where we know that

    a) Columns are identical across frames.
    b) Underlying block layout is identical across frames.

    i.e. these frames are generated by something like

    frames = [df.iloc[i:i+100] for i in range(0, len(df), 100)]

    Notes
    -----
    The caller is responsible for checking these conditions.
    """
    if len(frames) == 0:
       raise ValueError("frames must be non-empty.")

    if frames[0].shape[1] == 0:
       # no columns, can use non-optimized concat cheaply
       return pd.concat(frames, axis=0, ignore_index=True)

    mgrs = [df._mgr for df in frames]
    first = mgrs[0]

    nbs = []
    for i, blk in enumerate(first.blocks):
        arr = blk.values
        arrays = [mgr.blocks[i].values for mgr in mgrs]

        if arr.ndim == 1:
            # i.e. is_1d_only_ea_dtype
            new_arr = arr._concat_same_type(arrays)

        elif not isinstance(arr, np.ndarray):
            new_arr = arr._concat_same_type(arrays, axis=1)

        else:
            new_arr = np.concatenate(arrays, axis=1)

        nb = type(blk)(new_arr, placement=blk.mgr_locs, ndim=2)
        nbs.append(nb)

    index = frames[0].index.append([x.index for x in frames[1:]])
    axes = [frames[0].columns, index]
    new_mgr = type(first)(nbs, axes)
    return pd.DataFrame(new_mgr)


def check_equivalent(frames):
    result = concat_known_aligned(frames)
    expected = pd.concat(frames, axis=0)
    tm.assert_frame_equal(result, expected)


def test():
    df = tm.makeMixedDataFrame()
    frames = [df[i:i+1] for i in range(len(df))]
    check_equivalent(frames)
0reactions
jakirkhamcommented, Apr 12, 2022

When NumPy allocates memory, it registers those allocations with Python’s tracemalloc. In Python’s C API for tracemalloc, it acquires and releases the GIL to register allocations. As there is only one allocation occurring in np.concatenate and it is deferred until the final array’s shape & type are known, there should only be one acquisition/release of the GIL. Likely this is what is being seen here.

If there are multiple calls to np.concatenate, there could be multiple GIL acquisitions/releases, which would be something to avoid (ideally by using as few calls to np.concatenate as possible).

np.concatenate supports an out argument. So one could try preallocating the memory and passing it to concatenate and then profile for GIL usage of concatenate. Guessing nothing else will show up, but would be interesting to see regardless.

As the gist of concatenate is determining the resulting array’s metadata, allocating the result array, and copying data to the resulting array, would expect the last step (copying) to be the most time consuming. Features like hugepages would help improve performance (if available). Might be worth checking that is available and enabled on the machine used.

A separate benchmarking for your use case around copying might provide additional insight here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Better Shuffling in Dask: a Proof-of-Concept - Coiled
A new approach to shuffling in Dask​​ We confirmed two wholly unsurprising things: A dedicated, peer-to-peer, decentralized implementation is ...
Read more >
Dask DataFrames Best Practices
Usual pandas performance tips like avoiding apply, using vectorized operations, ... In particular, shuffling operations that rearrange data become much more ...
Read more >
DataFrame.shuffle - Dask documentation
Ignore index during shuffle. If True , performance may improve, but index values will not be preserved. compute: bool. Whether or not to ......
Read more >
Shuffling for GroupBy and Join - Dask documentation
Operations like groupby , join , and set_index have special performance considerations that are different from normal Pandas due to the parallel, larger-than- ......
Read more >
Dask Best Practices - Dask documentation
Modest time investments in profiling your code can help you to identify what ... snappy, and Z-Standard that provide better performance and random...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found