botocore.exceptions.CredentialRetrievalError raised when opening many files in parallel from S3
See original GitHub issueProblem description
Be sure your description clearly answers the following questions:
-
What are you trying to achieve? I’m trying to read several files in parallel using
multiprocessing
from S3. I’m using a singlec5.24xlarge
orm5a.24xlarge
EC2 instance which is running a single container. Note that each process is reading a different file. -
What is the expected result? The opens should be successful.
smart_open
should be smarter at supporting parallelism. -
What are you seeing instead? The following exception is raised:
botocore.exceptions.CredentialRetrievalError: Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response (429) from ECS metadata: You have reached maximum request limit.
Note that setting the environment variable ECS_TASK_METADATA_RPS_LIMIT="8000,9000"
didn’t help at all.
What helped is a retry after catching the exception and sleeping time.sleep(random.random())
, but there has got to be a cleaner way.
Steps/code to reproduce the problem
This line of code when run at about the same time from 168 simultaneous processes in a docker container raises the aforementioned exception:
with smart_open.open(input_s3_object_uri, “rb”) as input_file:
Versions
Please provide the output of:
import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
>>> import platform, sys, smart_open
>>> print(platform.platform())
Linux-4.19.76-linuxkit-x86_64-with-glibc2.10
>>> print("Python", sys.version)
Python 3.8.3 (default, May 19 2020, 18:47:26)
[GCC 7.3.0]
>>> print("smart_open", smart_open.__version__)
smart_open 2.0.0
Checklist
Before you create the issue, please make sure you have:
- Described the problem clearly
- Provided a minimal reproducible example, including any required data
- Provided the version numbers of the relevant software
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Is it not supposed to be smart? Or is it just semismart?
A pure exponential backoff doesn’t help with a uniform load distribution.
time.sleep(random.random() * exp_backoff_multiplier)
does, although this doesn’t define an appropriate choice of the backoff parameters.The above suggested strategy should work for 90%. On AWS the number of cores are limited to 128. It is doubtful that anyone would run more than 512 worker processes on such a node. Most users that use a naive retry would be better off with the above strategy.
Having “smart” in the name doesn’t mean “it does everything for you”. For example, equipping yourself with a smartphone, smartwatch, etc. doesn’t instantly make you a genius (quite often the opposite). You still need to apply some of your own effort.
Yeah, the devil is always in the details. What parameters to pass? How to pass them? How to enable/disable this functionality?
I think the best way forward in your use case is to handle the above questions in your application logic.