question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Python multiprocessing writes get deadlocked on Linux systems

See original GitHub issue

Hi everyone,

Our team is trying to write to zarr arrays on AWS S3 using Python’s built-in multiprocessing (mp) tools.

Once we start an mp.Pool() context, and try to write to a zarr.Array our tool deadlocks in certain environments.

Code sample and details below.

Code sample

Assuming S3 credentials are set up in your environment.

from itertools import repeat
import multiprocessing as mp

import numpy as np
import zarr


def worker(zarr_array, idx):
    zarr_array[idx, ...] = np.random.randn(1024, 1024).astype('float32')


if __name__ == "__main__":
    zarr_root_path = 's3://XXXXX/benchmarks/debug/test.zarr'  # Sorry, have to blank out S3 buket.

    root = zarr.open_group(zarr_root_path, mode='w')

    dataset = root.create_dataset(
        shape=(5, 1024, 1024),
        chunks=(1, -1, -1),
        name='test1',
        overwrite=True,
        synchronizer=zarr.ProcessSynchronizer('.lock'),
    )

    # dataset[:] = np.random.randn(*dataset.shape)

    iterable = zip(
        repeat(dataset),
        range(5),
    )

    with mp.Pool() as pool:
        pool.starmap(worker, iterable)

Problem description

We ran into this issue when working with AWS EC2 instances and our on-prem Linux workstations. My Windows laptop (x86_64) to S3 works as expected without any issues. The data is chunked by rows, and as you can see, our parallel writes are aligned with chunk boundaries. We are also using a zarr.ProcessSynchronizer() just in case. However, behavior is the same without one.

On a Windows to S3 environment, this code completes as expected. However, on the Linux tests, it doesn’t complete.

  • Does not create any files inside .zarray on S3
  • When we kill the process, we see it is stuck on threading??

If you comment the iterable, mp.Pool, and code associated with them; and uncomment the dataset[:] = np.random.randn(*dataset.shape), everything runs as expected on all systems. Issue persists with multiprocessing.

Things we have tried:

  • Older s3fs and fsspec versions
  • Different Python versions
  • Add/remove ProcessSynchronizer and ThreadSynchronizer
  • Tried blosc.use_threads = False, and turning off compression
  • More things I can’t remember…

Here is the traceback after we keyboard interrupt.

Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/ubuntu/develop/scratch/parallel_zarr.py", line 33, in <module>
    pool.starmap(worker, iterable)
  File "/home/ubuntu/miniconda3/envs/my_env/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/ubuntu/miniconda3/envs/my_env/lib/python3.9/multiprocessing/pool.py", line 765, in get
    self.wait(timeout)
  File "/home/ubuntu/miniconda3/envs/my_env/lib/python3.9/multiprocessing/pool.py", line 762, in wait
    self._event.wait(timeout)
  File "/home/ubuntu/miniconda3/envs/my_env/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "/home/ubuntu/miniconda3/envs/my_env/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

Version and installation information

  • AWS c6gd.12xlarge instance.
  • ARM 48 core CPU (aarch64); however on-prem boxes are x86_64.
  • Python 3.9.5 (also tested 3.7, and 3.8)
  • conda environment with conda version 4.10.1. Everything is installed via conda except 1 specialized package we built from source.
  • zarr version 2.8.1
  • s3fs version 0.6.0
  • fsspec version 2021.05.0
  • numcodecs version 0.7.3

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:17 (17 by maintainers)

github_iconTop GitHub Comments

3reactions
martindurantcommented, Jun 3, 2021

Have you tried multiprocessing context -> “spawn”? Does this work with dask-distributed and processes (which has a different serialisation model) instead of the pool?

2reactions
martindurantcommented, Jun 3, 2021

Is it more beneficial to use “forkserver” over “spawn” if performance is critical in a Unix environment?

For long-lived processed, it won’t matter, this is a one-time cost of starting the process (and happens to be the only option on Windows). Dask uses forkserver on linux by default, ought to be OK, but

Read more comments on GitHub >

github_iconTop Results From Across the Web

Occasional deadlock in multiprocessing.Pool - Stack Overflow
The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt is here. It indicates that the pool won't fetch new tasks...
Read more >
Why your multiprocessing Pool is stuck (it's full of sharks!)
On Linux, the default configuration of Python's multiprocessing library can lead to deadlocks and brokenness. Learn why, and how to fix it.
Read more >
multiprocessing — Process-based parallelism — Python 3.11 ...
from multiprocessing import Process import os def info(title): ... This means that if you try joining that process you may get a deadlock...
Read more >
The tragic tale of the deadlocking Python queue
If you're using an extremely common logging pattern, where writes happen in a different thread, a logging pattern explicitly supported by the ...
Read more >
Things I Wish They Told Me About Multiprocessing in Python
This also gets around one of the notorious Achilles Heels in Python: the Global Interpreter Lock (aka theGIL). This lock constrains all Python...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found