question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Set region flag before making getJob API call in AwsGlueJobHook

See original GitHub issue

Description

As a developer I want to be able to run AWS Glue jobs in different regions from a single Dag with different tasks.

Use case / motivation

The developer needs a workflow where the single dag invokes an AWS Glue job in every region.

Are you willing to submit a PR? Maybe if I have time

Related Issues NA

Related code Currently in this method https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_modules/airflow/providers/amazon/aws/hooks/glue.html#AwsGlueJobHook

 def get_or_create_glue_job(self) -> str:
        """
        Creates(or just returns) and returns the Job name
        :return:Name of the Job
        """
        glue_client = self.get_conn()
        try:
            get_job_response = glue_client.get_job(JobName=self.job_name)
            self.log.info("Job Already exist. Returning Name of the job")
            return get_job_response['Job']['Name']

        except glue_client.exceptions.EntityNotFoundException:
            self.log.info("Job doesnt exist. Now creating and running AWS Glue Job")
            if self.s3_bucket is None:
                raise AirflowException('Could not initialize glue job, error: Specify Parameter `s3_bucket`')
            s3_log_path = f's3://{self.s3_bucket}/{self.s3_glue_logs}{self.job_name}'
            execution_role = self.get_iam_execution_role()
            try:
                create_job_response = glue_client.create_job(
                    Name=self.job_name,
                    Description=self.desc,
                    LogUri=s3_log_path,
                    Role=execution_role['Role']['RoleName'],
                    ExecutionProperty={"MaxConcurrentRuns": self.concurrent_run_limit},
                    Command={"Name": "glueetl", "ScriptLocation": self.script_location},
                    MaxRetries=self.retry_limit,
                    AllocatedCapacity=self.num_of_dpus,
                    **self.create_job_kwargs,
                )
                return create_job_response['Name']
            except Exception as general_error:
                self.log.error("Failed to create aws glue job, error: %s", general_error)
                raise

The get_job method does not accept a region flag here so the region flag passed into the AwsGlueJobOperator doesn’t apply. This results in the method failing if we’re expecting the job to be in a different region other than the default one in the credential chain.

Work around we can set the AWS_DEFAULT_REGION environment variable in python using this line before the task runs. However this might have issues if there are concurrent dags running on the same machine and race conditions might happen.

os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ashbcommented, Apr 15, 2021

Looking at the code, the region_name from the AwsGlueOperator is handled in the constructor – none of the methods in the Aws hooks should take a region_name parameter.

And the operator take a region_name https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/glue/index.html?highlight=glue#airflow.providers.amazon.aws.operators.glue.AwsGlueJobOperator and passes it on, so from looking at the code, it should work like this:

  • If an explicit region_name is passed via the operator, that is used
  • If not, the default from whatever connection is used is taken.
  • All access via the AwsGlueHook should already respect the self.region_name property of the hook.
0reactions
github-actions[bot]commented, May 23, 2021

This issue has been closed because it has not received response from the issue author.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS Glue job parameters
You can supply the parameter/value pair via the AWS Glue console when creating or updating an AWS Glue job. Setting the value to...
Read more >
Glue — Boto3 Docs 1.26.32 documentation - AWS
When creating a table, you can pass an empty list of columns for the schema, ... Specifies a path in Amazon S3 where...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found