urllib3.exceptions.ProtocolError while reading many avro files from s3
See original GitHub issueI am using dask, and the recently added bag.read_avro
functionality, to filter a bunch of avro files in s3, then read them into a pandas dataframe.
I am reading a large number of files (>20000) with the following code on an ubuntu EC2 instance:
import dask.bag
# Specify some constants
URLPATH = 's3://bucket/path/*.avro'
def filter_app(data):
return data['payload']['application']['applicationId'] == 'app_name'
# Create bag
bag = dask.bag.read_avro(
URLPATH,
storage_options = {
'config_kwargs': {'max_pool_connections': 500} #To avoid connection pool is full errors, as discussed here: https://github.com/dask/dask/issues/3493
},
blocksize=None # lots of files, >20,000. setting this to None speeds things up considerably
)
bag = bag.filter(filter_app)
dd = bag.to_dataframe()
# convert to pandas
df = dd.compute(num_workers=100) # to speed things up since it is heavy IO
This works fine some of the time, but other times will fail with a urllib3.exceptions.ProtocolError
(testing on the exact same files each time).
I get the following error in the df = dd.compute(num_workers=100)
step:
Traceback
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 331, in _error_catcher
yield
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 409, in read
data = self._fp.read()
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/http/client.py", line 462, in read
s = self._safe_read(self.length)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/http/client.py", line 614, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(8427 bytes read, 8202 more expected)
During handling of the above exception, another exception occurred:
....
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 249, in _execute_task
return [_execute_task(a, cache) for a in arg]
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 249, in <listcomp>
return [_execute_task(a, cache) for a in arg]
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 252, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/local.py", line 253, in _execute_task
return func(*args2)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/dask/bag/avro.py", line 149, in read_file
return list(fastavro.iter_avro(f))
File "fastavro/_read.pyx", line 688, in fastavro._read.reader.__init__
File "fastavro/_read.pyx", line 654, in fastavro._read.file_reader.__init__
File "fastavro/_read.pyx", line 539, in fastavro._read._read_data
File "fastavro/_read.pyx", line 451, in fastavro._read.read_record
File "fastavro/_read.pyx", line 529, in fastavro._read._read_data
File "fastavro/_read.pyx", line 312, in fastavro._read.read_fixed
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1311, in read
self._fetch(self.loc, self.loc + length)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1275, in _fetch
req_kw=self.s3.req_kw)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/s3fs/core.py", line 1496, in _fetch_range
return resp['Body'].read()
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/botocore/response.py", line 78, in read
chunk = self._raw_stream.read(amt)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 430, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/home/ubuntu/miniconda3/envs/dask/lib/python3.6/site-packages/urllib3/response.py", line 349, in _error_catcher
raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(8427 bytes read, 8202 more expected)', IncompleteRead(8427 bytes read, 8202 more expected))
Versions
- Python 3.6
- s3fs 0.1.6
- dask 0.19.4
Any thoughts on what may be causing this? Is using 100 threads to read from S3 too much?
Issue Analytics
- State:
- Created 5 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
urllib3.exceptions.ProtocolError while reading many avro files ...
I am using dask, and the recently added bag.read_avro functionality, to filter a bunch of avro files in s3, then read them into...
Read more >Reading in-memory Avro file from S3: 'AttributeError:'
I'm trying to read Avro files stored in S3 by a vendor and write to a DW. See code below. (Was roughly working...
Read more >Apache Avro Data Source Guide - Spark 2.4.6 Documentation
To load/save data in Avro format, you need to specify the data source option ... Using Avro record as columns are useful when...
Read more >Getting urllib3 Protocol Error when iterating through data loader
Hey everyone, I'm getting a protocol error when iterating through the data loader. Something to do with the num_workers I think.
Read more >Working with Apache Avro files in Amazon S3 - Gary A. Stafford
There are multiple ways to read binary-encoded Avro data. I recommend using Apache Avro Tools, the Avro command-line tools and utilities JAR.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Note that s3fs does retry internally for some specific errors (below), so we could consider adding this to the list too.
Could be moved to s3fs