Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature: Multiprocessing-based Backend using SharedMemory + Pickle 5 for 2-3x Faster IPC

See original GitHub issue

Introduction

Hi all. I am writing a custom backend that uses multiprocessing processes, but for the IPC uses Python 3.8 SharedMemory and ouf-of-band Pickle Protocol 5. I am opening this issue to bring to your attention this work which could be of benefit to you because of its improved IPC performance and potentially merge it as a separate backend.

This new method of IPC is usually 2-3x faster than traditional Pipes.

Explanation

The purpose of SharedMemory (not file-based, but enabled by the OS kernel) is that it is a much more efficient mechanism to sharing large amounts of data than using a standard Pipe. It is also faster than using mmap’d file-based sharing.

The purpose of out-of-band Pickle Protocol 5 is that it performs fewer copies of the data during pickling. The pickling dumps returns a list of buffers rather than one data buffer, avoiding the aggregating copy there. Also, the loads builds the python object directly from the buffers without a copy there - effectively for free.

Demo

To illustrate the benefits of this type of IPC, here is a simple program comparing SharedMemory + Pickle 5 vs Pipes + regular Pickle. Note that the SharedMemory example does not set up a separate process, but transfer will be unaffected.

When transfering a large python object, in this case big_array = np.arange(5 * 10**7):

SharedMemory + Pickle 5 takes 0.6724 seconds Pipes + regular Pickle takes 1.6099 seconds

Demo Files

SharedMemory:

from multiprocessing.shared_memory import SharedMemory

import time
import numpy as np
import pickle
import pickle_utils
import copy

def sender(obj):
    # Pickle the object using out-of-band buffers, pickle 5
    buffers = []
    data = pickle.dumps(
        obj,
        protocol=pickle.HIGHEST_PROTOCOL,
        buffer_callback=lambda b: buffers.append(b.raw()),
    )  # type: ignore

    # Pack the buffers to be written to memory
    data_sz, data_ls = pickle_utils.pack_frames([data] + buffers)

    # Create and write to shared memory
    shared_mem = SharedMemory(create=True, size=data_sz)

    write_offset = 0
    for data in data_ls:
        write_end = write_offset + len(data)
        shared_mem.buf[write_offset:write_end] = data  # type: ignore

        write_offset = write_end

    # Clean up
    shared_mem.close()

    return shared_mem.name, data_sz

def receiver(shared_mem_name, data_sz):
    # Read the shared memory
    shared_mem = SharedMemory(name=shared_mem_name)
    data = shared_mem.buf[:data_sz]

    # Unpack and un-pickle the data buffers
    buffers = pickle_utils.unpack_frames(data)
    obj = pickle.loads(buffers[0], buffers=buffers[1:])  # type: ignore

    # Bring the `obj` out of shared memory
    ret = copy.deepcopy(obj)

    # Clean up
    del data
    del buffers
    del obj
    shared_mem.close()
    shared_mem.unlink()

start_time = time.time()

# Our big python data object
big_array = np.arange(5 * 10**7)

shared_mem_name, data_sz = sender(big_array)
obj = receiver(shared_mem_name, data_sz)

print("--- Total %s seconds ---" % (time.time() - start_time))

print(obj) # [       0        1        2 ... 49999997 49999998 49999999]

Pipes:

from multiprocessing import Process, Pipe

import time
import numpy as np

def sender(send_conn):
    # Our big python data object
    big_array = np.arange(5 * 10**7)

    send_conn.send(big_array)
    send_conn.close()

def receiver(recv_conn):
    obj = recv_conn.recv()
    recv_conn.close()

    return obj

recv_conn, send_conn = Pipe(duplex=False)

start_time = time.time()

p = Process(target=sender, args=(send_conn,))
p.start()

obj = receiver(recv_conn)

p.join()

print("--- Total %s seconds ---" % (time.time() - start_time))

print(obj) # [       0        1        2 ... 49999997 49999998 49999999]

Issue Analytics

State:
Created 3 years ago
Reactions:8
Comments:5

Top GitHub Comments

2reactions

DamianB-BitFlippercommented, Jul 28, 2022

You are all in luck:

# The original code was taken from the Dask serialization code base.
import struct


def nbytes(frame, _bytes_like=(bytes, bytearray)):
    """Extract number of bytes of a frame or memoryview."""
    if isinstance(frame, _bytes_like):
        return len(frame)
    else:
        try:
            return frame.nbytes
        except AttributeError:
            return len(frame)


def pack_frames_prelude(frames):
    """Pack the `frames` metadata."""
    lengths = [struct.pack("Q", len(frames))] + [
        struct.pack("Q", nbytes(frame)) for frame in frames
    ]
    return b"".join(lengths)


def pack_frames(frames):
    """Pack frames into a byte-like object.

    This prepends length information to the front of the bytes-like object

    See Also
    --------
    unpack_frames
    """
    prelude = [pack_frames_prelude(frames)]

    if not isinstance(frames, list):
        frames = list(frames)

    data_ls = prelude + frames
    data_sz = sum(map(lambda b: len(b), data_ls))
    return data_sz, data_ls


def unpack_frames(b):
    """Unpack bytes into a sequence of frames.

    This assumes that length information is at the front of the bytestring,
    as performed by pack_frames

    See Also
    --------
    pack_frames
    """
    (n_frames,) = struct.unpack("Q", b[:8])

    frames = []
    start = 8 + n_frames * 8
    for i in range(n_frames):
        (length,) = struct.unpack("Q", b[(i + 1) * 8 : (i + 2) * 8])
        frame = b[start : start + length]
        frames.append(frame)
        start += length

    return frames

1reaction

Mohamedgalilcommented, Jul 16, 2022

@DamianBarabonkovQC could you add the function pickle_utils.pack_frames that I am able to reproduce the provided code? I cannot find a library pickle_utils which provides the function pack_frames.

Top Results From Across the Web

joblib - Bountysource

I am writing a custom backend that uses multiprocessing processes, but for the IPC uses Python 3.8 SharedMemory and ouf-of-band Pickle Protocol 5....

multiprocessing.shared_memory — Shared memory for direct ...

Creates a new shared memory block or attaches to an existing shared memory block. Each shared memory block is assigned a unique name....

How to use shared memory instead of passing objects via ...

Direct memory sharing can be significantly faster than sharing via files, sockets, or data copy serialization/deserialization.

joblib - bytemeta

Feature : Multiprocessing-based Backend using SharedMemory + Pickle 5 for 2-3x Faster IPC · Previous Next. Make software development more efficient, ...