Unable to use copy_objects on Glue: InvalidRetryConfigurationError
See original GitHub issueDescribe the bug
I have a Glue job that copies some files from a folder to another:
wr.s3.copy_objects(filtered_files, s3_src_dir, s3_dst_dir)
The job runs perfectly locally. I can confirm that by checking the folder on s3: the files are where they should be. However, when I deploy this job and run it on Glue, I get the following exception:
File "/glue/lib/installation/okra_datalake/scraping.py", line 235, in copy_files
wr.s3.copy_objects(filtered_files, s3_src_dir, s3_dst_dir)
File "/glue/lib/installation/awswrangler/s3/_copy.py", line 187, in copy_objects
_copy_objects(batch=batch, use_threads=use_threads, boto3_session=session)
File "/glue/lib/installation/awswrangler/s3/_copy.py", line 19, in _copy_objects
resource_s3: boto3.resource = _utils.resource(service_name="s3", session=boto3_session)
File "/glue/lib/installation/awswrangler/_utils.py", line 78, in resource
retries=
{
"max_attempts": 10,
"mode": "adaptive"
}
, connect_timeout=10, max_pool_connections=30
File "/usr/local/lib/python3.6/site-packages/botocore/config.py", line 158, in __init__
self._validate_retry_configuration(self.retries)
File "/usr/local/lib/python3.6/site-packages/botocore/config.py", line 205, in _validate_retry_configuration
retry_config_option=key)
botocore.exceptions.InvalidRetryConfigurationError: Cannot provide retry configuration for "mode". Valid retry configuration options are: 'max_attempts'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/runscript.py", line 142, in <module>
raise e_type(e_value).with_traceback(new_stack)
TypeError: __init__() takes 1 positional argument but 2 were given
This seems to point back to this line:
It seems that the botocore version that runs on Glue doesn’t support this retry configuration, while the one that I run locally does. I naively thought I was running the same dependencies locally and on Glue, but apparently I was wrong.
Without going into too much details, when I deploy a job to Glue the dependencies I use in my virtualenv are packaged into a zip file, and uploaded to a s3 bucket. When a deployed job runs on Glue, it’s supposed to use the dependencies coming from this zip file. It works (more or less), since I can access libraries like aws-data-wrangler and others, but it seems the botocore version that then runs on Glue comes from AWS, not from my zip file.
Apparently I’m not the only one to have this issue: https://github.com/boto/boto3/issues/2566
All this led me to believe that I can’t use wr.s3.copy_objects
from aws-wrangler 1.9.3 on Glue at the moment. I haven’t tested other functions, but since it’s the _utils.resource
function that is affected, I assume other functions wouldn’t work either.
Versions I run locally:
- boto-2.49.0
- botocore-1.17.44
- awswrangler-1.9.3
To Reproduce
I haven’t tried to isolate this issue into a minimum running example, but I imagine a simple job deployed to Glue with a call to wr.s3.copy_objects
would trigger the error. aws-wrangler would need to be attached to the glue job (and potentially all dependencies needed by aws-wrangler 😕 )
Please let me know if you have an idea on how to solve that. I’d also be happy to provide more information for debugging.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (11 by maintainers)
Hi,
Thanks for your help. I have written some steps for people to follow to get the latest AWS CLI and boto3 APIs working. It just combines the reading from multiple sources
The Glue has awscli dependency as well along with boto3
AWS Glue Python Shell with Internet
Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.
Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
Add the following code snippet to load the new files
AWS Glue Python Shell without Internet connectivity
Reference : AWS Wrangler Glue dependency build
We followed the steps mentioned above for awscli and boto3 whl files.
Below is the latest requirements.txt compiled for the newest versions
Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell. Additional code as per below thread
https://forums.aws.amazon.com/thread.jspa?messageID=954344
Thanks, Sarath
Thanks for sharing your strategy, it’s pretty cool.
The best for me is about avoid the internet access, it also can be useful in situations like this.
Yes, I think boto/botocore is probably tied with some internal dependencies like AWS CLI and the script that upload your code from s3 into the job filesystem before it start.