question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SSL error when loading many files from S3

See original GitHub issue

I’m trying out a little word count example on a bunch of (144 to be exact) GZIP newline-delimited JSON files on S3 in a private S3 bucket, and I’ve been running into SSL errors like the one below. These errors are intermittent, unfortunately, and happen maybe 1/4 of the time.

Without even trying to do word counting, I get this just by counting lines:

g = (db.read_text('s3://path/to/input/*/*.gz')
     .count())
g.compute()

I haven’t gotten too far with tracking it down, but it seems possibly related to issues with sharing SSL connections across threads (as discussed here). It may also be related to this issue with thread safety in requests and this more specific one in conda’s s3 channel support.

Any ideas? Maybe dask needs to use different boto clients per thread?

This very well might be an upstream issue with botocore, requests, etc.

Versions and such:

  • dask 0.10.0
  • python 3.4
  • botocore 1.4.26
  • ubuntu (on EC2)

Traceback:

---------------------------------------------------------------------------
SSLError                                  Traceback (most recent call last)
<ipython-input-39-db600568b11b> in <module>()
      1 g = (db.read_text('s3://path/to/input/*/*.gz')
      2      .count())
----> 3 g.compute()

/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/base.py in compute(self, **kwargs)
     84             Extra keywords to forward to the scheduler ``get`` function.
     85         """
---> 86         return compute(self, **kwargs)[0]
     87 
     88     @classmethod

/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/base.py in compute(*args, **kwargs)
    177         dsk = merge(var.dask for var in variables)
    178     keys = [var._keys() for var in variables]
--> 179     results = get(dsk, keys, **kwargs)
    180 
    181     results_iter = iter(results)

/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs)
     77         # Run
     78         result = get_async(apply_async, len(pool._pool), dsk3, keys,
---> 79                            queue=queue, get_id=_process_get_id, **kwargs)
     80     finally:
     81         if cleanup:

/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs)
    482                 _execute_task(task, data)  # Re-execute locally
    483             else:
--> 484                 raise(remote_exception(res, tb))
    485         state['cache'][key] = res
    486         finish_task(dsk, key, state, results, keyorder.get)

SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1748)

Traceback
---------
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 267, in execute_task
    result = _execute_task(task, data)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
    return [_execute_task(a, cache) for a in arg]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 245, in <listcomp>
    return [_execute_task(a, cache) for a in arg]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 248, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/dask/bytes/s3.py", line 103, in s3_open_file
    return s3.open(path, mode='rb')
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 212, in open
    return S3File(self, path, mode, block_size=block_size)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 680, in __init__
    self.size = self.info()['Size']
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 686, in info
    return self.s3.info(self.path)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 303, in info
    files = self._lsdir(parent, refresh=refresh)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/s3fs/core.py", line 226, in _lsdir
    for i in it:
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/paginate.py", line 102, in __iter__
    response = self._make_request(current_kwargs)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/paginate.py", line 174, in _make_request
    return self._method(**current_kwargs)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/client.py", line 262, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/client.py", line 541, in _make_api_call
    operation_model, request_dict)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 117, in make_request
    return self._send_request(request_dict, operation_model)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 146, in _send_request
    success_response, exception):
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 219, in _needs_retry
    caught_exception=caught_exception)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/hooks.py", line 227, in emit
    return self._emit(event_name, kwargs)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/hooks.py", line 210, in _emit
    response = handler(**kwargs)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 266, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 314, in __call__
    caught_exception)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/retryhandler.py", line 356, in _check_caught_exception
    raise caught_exception
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/endpoint.py", line 174, in _get_response
    proxies=self.proxies, timeout=self.timeout)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/sessions.py", line 605, in send
    r.content
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/models.py", line 750, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/models.py", line 673, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 303, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 447, in read_chunked
    self._update_chunk_length()
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 394, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/socket.py", line 378, in readinto
    return self._sock.recv_into(b)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/ssl.py", line 748, in recv_into
    return self.read(nbytes, buffer)
  File "/mnt/mheilman/miniconda3/envs/3.4/lib/python3.4/ssl.py", line 620, in read
    v = self._sslobj.read(len, buffer)

Note: I originally filed this as an s3fs issue by accident (https://github.com/dask/s3fs/issues/55), but I think it maybe makes more sense here.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
martindurantcommented, Jun 15, 2016

If I understood, the issues you link would occur if the boto session was being shared between threads/processes. S3FileSystem is designed to pickle without the object, and create a new one when restored, and I believe the dask code always calls the constructor rather than sharing instances. However, if occurs to me that using current() and the cached connections in ._conn could result in copying of objects if the S3FileSystem class (as opposed to instances) were being shared.

Perhaps one way to test would be to to print S3FileSystem._conn from each thread/process when so far the only S3 actions have taken place in the client.

0reactions
mrocklincommented, Jun 16, 2016

Very glad to see that this seems to fix the issue. Can I interest you in submitting a PR with your solution?

I don’t think that this is GIL related. This is happening at a layer that the GIL doesn’t strongly affect. There are likely system files (sockets) that are being shared between multiple processes without those processes being aware of the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SSlError from concurrent s3 file download - Stack Overflow
Basically I have a program that downloads all of the files from an s3 bucket and parses through each file's contents and adds...
Read more >
Resolve errors uploading data to or downloading data from ...
To load data as a text from Amazon S3 to Amazon Aurora, run the LOAD DATA FROM S3 command in Amazon Aurora. Some...
Read more >
Misconfigured SSL Connection to Amazon S3 Bucket
Misconfiguring the SSL connection will cause a network connection error when attempting to connect to an Amazon S3 bucket.
Read more >
[Solved]-ssl error with copying file in s3 server?-django
Related Query · ssl error with copying file in s3 server? · Error running django server from PyCharm with local settings file ·...
Read more >
AWS S3 file read from specific bucket with SSL - KNIME Forum
202-b08\jre\lib\security\cacerts , I still get the error. I've verified that the cert chain that imported is present by listing the certs in the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found