question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Exception in interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem`

See original GitHub issue

What happened: Using s3fs with pyarrow within petastorm throws an exception (TypeError: 'coroutine' object is not iterable) in multiple places that use pyarrow.parquet.ParquetDataset (petastorm.reader.make_reader, petastorm.reader.make_batch_reader, etc.).

What you expected to happen: No exception

Minimal Complete Verifiable Example:

import pyarrow.parquet as pq
from s3fs import S3FileSystem
from pyarrow.filesystem import S3FSWrapper

fs = S3FSWrapper(S3FileSystem())
dataset_url = "s3://our-bucket/series/of/prefixes/partition1=foo/partition2=bar"

# This throws the relevant exception
dataset = pq.ParquetDataset(dataset_url, filesystem=fs)
./env/lib/python3.7/site-packages/pyarrow/filesystem.py:394: RuntimeWarning: coroutine 'S3FileSystem._ls' was never awaited
  for key in list(self.fs._ls(path, refresh=refresh)):
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 1348, in _make_manifest
    metadata_nthreads=metadata_nthreads)
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 927, in __init__
    self._visit_level(0, self.dirpath, [])
  File "./env/lib/python3.7/site-packages/pyarrow/parquet.py", line 942, in _visit_level
    _, directories, files = next(fs.walk(base_path))
  File "./env/lib/python3.7/site-packages/pyarrow/filesystem.py", line 394, in walk
    for key in list(self.fs._ls(path, refresh=refresh)):
TypeError: 'coroutine' object is not iterable

Anything else we need to know?: I’m having trouble navigating the bug-reporting process for Apache Arrow, if you’re able to pass this on to them.

Environment:

  • Dask version: 0.5.1
  • Python version: 3.7.8
  • Operating System: Mac OS 10.15.6
  • Install method (conda, pip, source): pip install s3fs==0.5.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Sep 14, 2020

You can turn on logging in s3fs to see what the calls are

On September 14, 2020 5:04:02 PM EDT, David McGuire notifications@github.com wrote:

I can’t sniff the AWS requests because they’re encrypted at the transport and application layers. However, as I mentioned in the other ticket (#365), it’s done doing meaningful work (as measured by the number of connections to S3 in ESTABLISHED state) almost immediately, after, perhaps, 10-15 seconds:

Launching:

$ time python repro.py &
[1] 75641

Measuring (note the Process ID above is for time, so using the child PID):

$ lsof -p 75642 | grep s3 | wc -l
     11
$ lsof -p 75642 | grep s3 | wc -l
     11
$ lsof -p 75642 | grep s3 | wc -l
     11
$ lsof -p 75642 | grep s3 | wc -l
     10
$ lsof -p 75642 | grep s3 | wc -l
      1
$ lsof -p 75642 | grep s3 | wc -l
      0

I have to transcribe the PID, but it couldn’t take me longer than maybe 5s to do that, then I’m measuring for maybe 10s before the connections all disappear. This has all the classic hallmarks of deadlock, which is why I wrote up the other ticket that way.

– You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dask/s3fs/issues/366#issuecomment-692313196

– Sent from my Android device with K-9 Mail. Please excuse my brevity.

0reactions
dmcguire81commented, Sep 17, 2020

I’ll follow up with petastorm to have them remove the wrapper, if pyarrow isn’t going (/ doesn’t need to) maintain it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Exception in interaction between pyarrow.filesystem ... - GitHub
What happened: Using s3fs with pyarrow within petastorm throws an exception (TypeError: 'coroutine' object is not iterable) in multiple ...
Read more >
pyarrow.fs.S3FileSystem — Apache Arrow v10.0.1
Create a new FileSystem from URI or Path. Recognized URI schemes are “file”, “mock”, “s3fs”, “hdfs” and “viewfs”. In addition, the argument ...
Read more >
ARROW-1213: [Python] Support s3fs filesystem for Amazon ...
HadoopFilesystem `` uses libhdfs, a JNI-based +interface to the Java ... import (ArrowException, ArrowTypeError) -from pyarrow.filesystem ...
Read more >
Source code for s3fs.core - Read the Docs
[docs]class S3FileSystem(AsyncFileSystem): """ Access S3 as if it were a file system. This exposes a filesystem-like API (ls, cp, open, etc.) on top...
Read more >
Apache Arrow 6.0.1 (2021-11-18)
ARROW-10921 - `TypeError: 'coroutine' object is not iterable` when reading parquet partitions via s3fs >= 0.5 with pyarrow ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found