question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCSFileSystem() hangs when called from multiple processes

See original GitHub issue

What happened: In the last two versions of gcsfs (versions 2021.04.0 and 0.8.0), calling gcsfs.GCSFileSystem() from multiple processes hangs without any error messages if gcsfs.GCSFileSystem() has been called previously in the same Python interpreter session.

This bug was not present in gcsfs version 0.7.2 (with fsspec 0.8.7). All the code examples below work perfectly with gcsfs version 0.7.2 (with fsspec 0.8.7).

Minimal Complete Verifiable Example:

The examples below assume gcsfs version 2021.04.0 is installed (with fsspec 2021.04.0) or gcsfs version 0.8.0 (with fsspec 0.9.0)

Install a fresh conda environment: conda create --name test_gcsfs python=3.8 gcsfs ipykernel

The last block of this code hangs:

from concurrent.futures import ProcessPoolExecutor
import gcsfs

# This line works fine!  (And it's fine to repeat this line multiple times.)
gcs = gcsfs.GCSFileSystem() 

# This block hangs, with no error messages:
with ProcessPoolExecutor() as executor:
    for i in range(8):
        future = executor.submit(gcsfs.GCSFileSystem)

But, if we don’t do gcs = gcsfs.GCSFileSystem(), then the code works fine. The next code example works perfectly, if run in a fresh Python interpreter. The only difference between the next code example and the previous code example is I’ve removed gcs = gcsfs.GCSFileSystem().

from concurrent.futures import ProcessPoolExecutor
import gcsfs

# This works fine:
with ProcessPoolExecutor() as executor:
    for i in range(8):
        future = executor.submit(gcsfs.GCSFileSystem)

Likewise, calling the ProcessPoolExecutor multiple times works the first time, but hangs on subsequent tries:

from concurrent.futures import ProcessPoolExecutor
import gcsfs

def process_pool():
    with ProcessPoolExecutor(max_workers=1) as executor:
        for i in range(8):
            future = executor.submit(gcsfs.GCSFileSystem)

# The first attempt works fine:
process_pool()

# This second attempt hangs:
process_pool()

Anything else we should know

Thank you so much for all your hard work on gcsfs - it’s a hugely useful tool! Sorry to be reporting a bug!

I tested all this code in a Jupyter Lab notebook.

This issue might be related to this Stack Overflow issue: https://stackoverflow.com/questions/66283634/use-gcsfilesystem-with-multiprocessing

Environment:

  • Dask version: Not installed
  • Python version: 3.8
  • Operating System: Ubuntu 20.10
  • Install method: conda, from conda-forge, using a fresh conda environment.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:34 (17 by maintainers)

github_iconTop GitHub Comments

6reactions
martindurantcommented, May 12, 2021

Maybe call gcs.clear_instance_cache() before the block instead of at the end, or include skip_instance_cache=True in the constructor; but this still doesn’t clear the reference to the loop and thread. You could do that with

fsspec.asyn.iothread[0] = None
fsspec.asyn.loop[0] = None

and that is what any fork-detecting code should be doing.

5reactions
JackKellycommented, May 13, 2021

I’ve done a few more experiments (in the hopes that this might be of use to other people in a similar situation; or maybe useful to help understand what’s going on!)

It turns out that fsspec.asyn.iothread[0] = None; fsspec.asyn.loop[0] = None needs to be run in every worker process. It’s not sufficient to just do this in the parent process.

It doesn’t matter if the code does fsspec.asyn.iothread[0] = None; fsspec.asyn.loop[0] = None before or after gcs = gcsfs.GCSFileSystem().

When using fsspec.asyn.iothread[0] = None; fsspec.asyn.loop[0] = None, it’s no longer necessary to do skip_instance_cache=True or gcs.clear_instance_cache().

Each worker process has to open the Zarr store. If I try lazily opening the Zarr store in the main process and passing this object into each worker process then fsspec throws an error saying it’s not thread safe. That’s fine, it’s no problem for my code to open the Zarr store in each worker process.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GCSFileSystem() hangs when called from multiple processes
GCSFileSystem() from multiple processes hangs without any error messages if gcsfs.GCSFileSystem() has been called previously in the same Python ...
Read more >
Use GCSFileSystem with MultiProcessing - Stack Overflow
I am trying to run a program using gcsfs.GCSFileSystem to access Google Cloud Storage, all that using python's concurrent ...
Read more >
API — GCSFs 2022.11.0+2.g111769a documentation
Connect to Google Cloud Storage. GCSFileSystem.cat (path[, recursive, on_error]). Fetch (potentially multiple) paths' contents.
Read more >
apache-arrow | Yarn - Package Manager
... ARROW-13284 - [C++] Fix wrong pkg_check_modules() option name ... hierachy) for checking whether data type instances are members of various type classes ......
Read more >
mrjob Documentation - Read the Docs
To define multiple steps, override steps() to return a ... Each call to the combiner gets a word as the key and a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found