Deadlock in the interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem`
See original GitHub issuePlease be concise with code posted. See guidelines below on how to provide a good bug report:
- Craft Minimal Bug Reports http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
- Minimal Complete Verifiable Examples https://stackoverflow.com/help/mcve
Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly. –>
What happened: Some interaction between s3fs
, pyarrow
, and petastorm
causes deadlock
What you expected to happen: s3fs
to be threadsafe, if pyarrow
is using it that way
Minimal Complete Verifiable Example:
import pyarrow.parquet as pq
from petastorm.fs_utils import get_filesystem_and_path_or_paths, normalize_dir_url
dataset_url = 's3://<redacted>'
# Repeat basic steps that make_reader or make_batch_reader normally does
dataset_url = normalize_dir_url(dataset_url)
fs, path = get_filesystem_and_path_or_paths(dataset_url)
# Finished in seconds
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=1)
# Hung all night
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=10)
# Their code
>>> type(fs)
<class 'pyarrow.filesystem.S3FSWrapper'>
# Your code
>>> type(fs.fs)
<class 's3fs.core.S3FileSystem'>
Anything else we need to know?:
If your code is not threadsafe, that would appear to be news to pyarrow
. Also reported to Petastorm. Will be reported to PyArrow.
Environment:
- Dask version:
0.4.2
- Python version:
3.7.8
- Operating System: Mac OS 10.15.6
- Install method (conda, pip, source):
pip install s3fs==0.4.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:22 (5 by maintainers)
Top Results From Across the Web
Deadlock in the interaction between `pyarrow.filesystem. ...
S3FSWrapper and s3fs.core.S3FileSystem #365 ... What happened: Some interaction between s3fs , pyarrow , and petastorm causes deadlock.
Read more >[Python] Deadlock in the interaction of pyarrow FileSystem ...
@martindurant good news (for you): I have a repro test case that is 100% pyarrow, so it looks like s3fs is not involved....
Read more >Apache Arrow 6.0.1 (2021-11-18)
ARROW-10921 - `TypeError: 'coroutine' object is not iterable` when reading parquet partitions via s3fs >= 0.5 with pyarrow ...
Read more >Source - GitHub
... [Python][Doc] Document the fsspec wrapper for pyarrow.fs filesystems ... [C++] Fix copying objects with special characters on S3FS ...
Read more >Apache Arrow 3.0.0 Release
... split packages in arrow-memory-core and arrow-vectors ARROW-10345 ... [Python] pyarrow doesn't work with s3fs>=0.5 ARROW-10434 - [Rust] ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Reported as ARROW-10029.
Cool, thanks for further looking into it and figuring it out @dmcguire81 !