Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to use copy_objects on Glue: InvalidRetryConfigurationError

See original GitHub issue

Describe the bug

I have a Glue job that copies some files from a folder to another:

wr.s3.copy_objects(filtered_files, s3_src_dir, s3_dst_dir)

The job runs perfectly locally. I can confirm that by checking the folder on s3: the files are where they should be. However, when I deploy this job and run it on Glue, I get the following exception:

File "/glue/lib/installation/okra_datalake/scraping.py", line 235, in copy_files
    wr.s3.copy_objects(filtered_files, s3_src_dir, s3_dst_dir)
  File "/glue/lib/installation/awswrangler/s3/_copy.py", line 187, in copy_objects
    _copy_objects(batch=batch, use_threads=use_threads, boto3_session=session)
  File "/glue/lib/installation/awswrangler/s3/_copy.py", line 19, in _copy_objects
    resource_s3: boto3.resource = _utils.resource(service_name="s3", session=boto3_session)
  File "/glue/lib/installation/awswrangler/_utils.py", line 78, in resource
    retries=
{
    "max_attempts": 10,
    "mode": "adaptive"
}
, connect_timeout=10, max_pool_connections=30
  File "/usr/local/lib/python3.6/site-packages/botocore/config.py", line 158, in __init__
    self._validate_retry_configuration(self.retries)
  File "/usr/local/lib/python3.6/site-packages/botocore/config.py", line 205, in _validate_retry_configuration
    retry_config_option=key)
botocore.exceptions.InvalidRetryConfigurationError: Cannot provide retry configuration for "mode". Valid retry configuration options are: 'max_attempts'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
TypeError: __init__() takes 1 positional argument but 2 were given

This seems to point back to this line:

https://github.com/awslabs/aws-data-wrangler/blob/6b61cdbd19fad65011fce75597d7293722dbb57b/awswrangler/_utils.py#L78

It seems that the botocore version that runs on Glue doesn’t support this retry configuration, while the one that I run locally does. I naively thought I was running the same dependencies locally and on Glue, but apparently I was wrong.

Without going into too much details, when I deploy a job to Glue the dependencies I use in my virtualenv are packaged into a zip file, and uploaded to a s3 bucket. When a deployed job runs on Glue, it’s supposed to use the dependencies coming from this zip file. It works (more or less), since I can access libraries like aws-data-wrangler and others, but it seems the botocore version that then runs on Glue comes from AWS, not from my zip file.

Apparently I’m not the only one to have this issue: https://github.com/boto/boto3/issues/2566

All this led me to believe that I can’t use wr.s3.copy_objects from aws-wrangler 1.9.3 on Glue at the moment. I haven’t tested other functions, but since it’s the _utils.resource function that is affected, I assume other functions wouldn’t work either.

Versions I run locally:

boto-2.49.0
botocore-1.17.44
awswrangler-1.9.3

To Reproduce

I haven’t tried to isolate this issue into a minimum running example, but I imagine a simple job deployed to Glue with a call to wr.s3.copy_objects would trigger the error. aws-wrangler would need to be attached to the glue job (and potentially all dependencies needed by aws-wrangler 😕 )

Please let me know if you have an idea on how to solve that. I’d also be happy to provide more information for debugging.

Issue Analytics

State:
Created 3 years ago
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

sarath-meccommented, Nov 27, 2020

Hi,

Thanks for your help. I have written some steps for people to follow to get the latest AWS CLI and boto3 APIs working. It just combines the reading from multiple sources

The Glue has awscli dependency as well along with boto3

AWS Glue Python Shell with Internet

Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.

Download the following whl files from boto3 files and awscli files

Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
Add the following code snippet to load the new files

#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
import sys
sys.path.insert(0, '/glue/lib/installation')
keys = [k for k in sys.modules.keys() if 'boto' in k]
for k in keys:
    if 'boto' in k:
       del sys.modules[k]

import boto3
print('boto3 version')
print(boto3.__version__)

AWS Glue Python Shell without Internet connectivity

Reference : AWS Wrangler Glue dependency build

We followed the steps mentioned above for awscli and boto3 whl files.
Below is the latest requirements.txt compiled for the newest versions

colorama==0.4.3
docutils==0.15.2
rsa==4.5.0
s3transfer==0.3.3
PyYAML==5.3.1
botocore==1.19.23
pyasn1==0.4.8
jmespath==0.10.0
urllib3==1.26.2
python_dateutil==2.8.1
six==1.15.0

Download the dependencies to libs folder

pip download -r requirements.txt -d libs

Move the original main whl files also to the lib directory

Package as a zip file

cd libs
zip ../boto3-depends.zip *

Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell. Additional code as per below thread

https://forums.aws.amazon.com/thread.jspa?messageID=954344

import os.path
import subprocess
import sys

# borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
    for dir_name in sys.path:
        candidate = os.path.join(dir_name, file_name)
        if matchFunc(candidate):
            return candidate
    raise Exception("Can't find file: ".format(file_name))

zip_file = get_referenced_filepath("awswrangler-depends.zip")

subprocess.run(["unzip", zip_file])

# Can't install --user, or without "-t ." because of permissions issues on the filesystem
subprocess.run(["pip3 install --no-index --find-links=. -t . *.whl"], shell=True)

#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
sys.path.insert(0, '/glue/lib/installation')
keys = [k for k in sys.modules.keys() if 'boto' in k]
for k in keys:
    if 'boto' in k:
       del sys.modules[k]

import boto3
print('boto3 version')
print(boto3.__version__)

Check if the code is working with latest AWS CLI API

Thanks, Sarath

1reaction

igorborgestcommented, Sep 21, 2020

Thanks for sharing your strategy, it’s pretty cool.

The best for me is about avoid the internet access, it also can be useful in situations like this.

I think boto/botocore might be the tricky dependencies.

Yes, I think boto/botocore is probably tied with some internal dependencies like AWS CLI and the script that upload your code from s3 into the job filesystem before it start.