question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

urllib3.exceptions.ProtocolError while reading many avro files from s3

See original GitHub issue

I am using dask, and the recently added bag.read_avro functionality, to filter a bunch of avro files in s3, then read them into a pandas dataframe.

I am reading a large number of files (>20000) with the following code on an ubuntu EC2 instance:

import dask.bag

# Specify some constants
URLPATH = 's3://bucket/path/*.avro'

def filter_app(data):
    return data['payload']['application']['applicationId'] == 'app_name'

# Create bag
bag = dask.bag.read_avro(
    URLPATH,
     storage_options = {
        'config_kwargs': {'max_pool_connections': 500} #To avoid connection pool is full errors, as discussed here: https://github.com/dask/dask/issues/3493 
    },
    blocksize=None # lots of files, >20,000. setting this to None speeds things up considerably
)


bag = bag.filter(filter_app)
dd = bag.to_dataframe()

# convert to pandas
df = dd.compute(num_workers=100) # to speed things up since it is heavy IO

This works fine some of the time, but other times will fail with a urllib3.exceptions.ProtocolError (testing on the exact same files each time).

I get the following error in the df = dd.compute(num_workers=100) step:

Traceback

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 331, in _error_catcher
    yield
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 409, in read
    data = self._fp.read()
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/http/client.py", line 462, in read
    s = self._safe_read(self.length)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/http/client.py", line 614, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(8427 bytes read, 8202 more expected)

During handling of the above exception, another exception occurred:
....

  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 249, in _execute_task
    return [_execute_task(a, cache) for a in arg]
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 249, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 253, in _execute_task
    return func(*args2)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/bag/avro.py", line 149, in read_file
    return list(fastavro.iter_avro(f))
  File "fastavro/_read.pyx", line 688, in fastavro._read.reader.__init__
  File "fastavro/_read.pyx", line 654, in fastavro._read.file_reader.__init__
  File "fastavro/_read.pyx", line 539, in fastavro._read._read_data
  File "fastavro/_read.pyx", line 451, in fastavro._read.read_record
  File "fastavro/_read.pyx", line 529, in fastavro._read._read_data
  File "fastavro/_read.pyx", line 312, in fastavro._read.read_fixed
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1311, in read
    self._fetch(self.loc, self.loc + length)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1275, in _fetch
    req_kw=self.s3.req_kw)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1496, in _fetch_range
    return resp['Body'].read()
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/botocore/response.py", line 78, in read
    chunk = self._raw_stream.read(amt)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 430, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 349, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(8427 bytes read, 8202 more expected)', IncompleteRead(8427 bytes read, 8202 more expected))

Versions

  • Python 3.6
  • s3fs 0.1.6
  • dask 0.19.4

Any thoughts on what may be causing this? Is using 100 threads to read from S3 too much?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Oct 19, 2018

Note that s3fs does retry internally for some specific errors (below), so we could consider adding this to the list too.

try:
    from boto3.s3.transfer import S3_RETRYABLE_ERRORS
except ImportError:
    S3_RETRYABLE_ERRORS = (
        socket.timeout,
    )
0reactions
martindurantcommented, Apr 30, 2019

Could be moved to s3fs

Read more comments on GitHub >

github_iconTop Results From Across the Web

urllib3.exceptions.ProtocolError while reading many avro files ...
I am using dask, and the recently added bag.read_avro functionality, to filter a bunch of avro files in s3, then read them into...
Read more >
Reading in-memory Avro file from S3: 'AttributeError:'
I'm trying to read Avro files stored in S3 by a vendor and write to a DW. See code below. (Was roughly working...
Read more >
Apache Avro Data Source Guide - Spark 2.4.6 Documentation
To load/save data in Avro format, you need to specify the data source option ... Using Avro record as columns are useful when...
Read more >
Getting urllib3 Protocol Error when iterating through data loader
Hey everyone, I'm getting a protocol error when iterating through the data loader. Something to do with the num_workers I think.
Read more >
Working with Apache Avro files in Amazon S3 - Gary A. Stafford
There are multiple ways to read binary-encoded Avro data. I recommend using Apache Avro Tools, the Avro command-line tools and utilities JAR.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found