SSL error when loading many files from S3
See original GitHub issueI’m trying out a little word count example on a bunch of (144 to be exact) GZIP newline-delimited JSON files on S3 in a private S3 bucket, and I’ve been running into SSL errors like the one below. These errors are intermittent, unfortunately, and happen maybe 1/4 of the time.
Without even trying to do word counting, I get this just by counting lines:
g = (db.read_text('s3://path/to/input/*/*.gz')
.count())
g.compute()
I haven’t gotten too far with tracking it down, but it seems possibly related to issues with sharing SSL connections across threads (as discussed here). It may also be related to this issue with thread safety in requests
and this more specific one in conda’s s3 channel support.
Any ideas? Maybe dask
needs to use different boto clients per thread?
This very well might be an upstream issue with botocore
, requests
, etc.
Versions and such:
- dask 0.10.0
- python 3.4
- botocore 1.4.26
- ubuntu (on EC2)
Traceback:
---------------------------------------------------------------------------
SSLError Traceback (most recent call last)
<ipython-input-39-db600568b11b> in <module>()
1 g = (db.read_text('s3://path/to/input/*/*.gz')
2 .count())
----> 3 g.compute()
/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/base.py in compute(self, **kwargs)
84 Extra keywords to forward to the scheduler ``get`` function.
85 """
---> 86 return compute(self, **kwargs)[0]
87
88 @classmethod
/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/base.py in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs)
77 # Run
78 result = get_async(apply_async, len(pool._pool), dsk3, keys,
---> 79 queue=queue, get_id=_process_get_id, **kwargs)
80 finally:
81 if cleanup:
/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
482 _execute_task(task, data) # Re-execute locally
483 else:
--> 484 raise(remote_exception(res, tb))
485 state['cache'][key] = res
486 finish_task(dsk, key, state, results, keyorder.get)
SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)
Traceback
---------
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 267, in execute_task
result = _execute_task(task, data)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return [_execute_task(a, cache) for a in arg]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 245, in <listcomp>
return [_execute_task(a, cache) for a in arg]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/bytes/s3.py", line 103, in s3_open_file
return s3.open(path, mode='rb')
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 212, in open
return S3File(self, path, mode, block_size=block_size)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 680, in __init__
self.size = self.info()['Size']
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 686, in info
return self.s3.info(self.path)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 303, in info
files = self._lsdir(parent, refresh=refresh)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 226, in _lsdir
for i in it:
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/paginate.py", line 102, in __iter__
response = self._make_request(current_kwargs)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/paginate.py", line 174, in _make_request
return self._method(**current_kwargs)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/client.py", line 262, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/client.py", line 541, in _make_api_call
operation_model, request_dict)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 117, in make_request
return self._send_request(request_dict, operation_model)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 146, in _send_request
success_response, exception):
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 219, in _needs_retry
caught_exception=caught_exception)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 266, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 314, in __call__
caught_exception)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 356, in _check_caught_exception
raise caught_exception
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 174, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/sessions.py", line 605, in send
r.content
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/models.py", line 750, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/models.py", line 673, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 303, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 447, in read_chunked
self._update_chunk_length()
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 394, in _update_chunk_length
line = self._fp.fp.readline()
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/socket.py", line 378, in readinto
return self._sock.recv_into(b)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/ssl.py", line 748, in recv_into
return self.read(nbytes, buffer)
File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/ssl.py", line 620, in read
v = self._sslobj.read(len, buffer)
Note: I originally filed this as an s3fs issue by accident (https://github.com/dask/s3fs/issues/55), but I think it maybe makes more sense here.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:14 (14 by maintainers)
Top GitHub Comments
If I understood, the issues you link would occur if the boto session was being shared between threads/processes. S3FileSystem is designed to pickle without the object, and create a new one when restored, and I believe the dask code always calls the constructor rather than sharing instances. However, if occurs to me that using
current()
and the cached connections in._conn
could result in copying of objects if the S3FileSystem class (as opposed to instances) were being shared.Perhaps one way to test would be to to print
S3FileSystem._conn
from each thread/process when so far the only S3 actions have taken place in the client.Very glad to see that this seems to fix the issue. Can I interest you in submitting a PR with your solution?
I don’t think that this is GIL related. This is happening at a layer that the GIL doesn’t strongly affect. There are likely system files (sockets) that are being shared between multiple processes without those processes being aware of the issue.