Set region flag before making getJob API call in AwsGlueJobHook
See original GitHub issueDescription
As a developer I want to be able to run AWS Glue jobs in different regions from a single Dag with different tasks.
Use case / motivation
The developer needs a workflow where the single dag invokes an AWS Glue job in every region.
Are you willing to submit a PR? Maybe if I have time
Related Issues NA
Related code Currently in this method https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_modules/airflow/providers/amazon/aws/hooks/glue.html#AwsGlueJobHook
def get_or_create_glue_job(self) -> str:
"""
Creates(or just returns) and returns the Job name
:return:Name of the Job
"""
glue_client = self.get_conn()
try:
get_job_response = glue_client.get_job(JobName=self.job_name)
self.log.info("Job Already exist. Returning Name of the job")
return get_job_response['Job']['Name']
except glue_client.exceptions.EntityNotFoundException:
self.log.info("Job doesnt exist. Now creating and running AWS Glue Job")
if self.s3_bucket is None:
raise AirflowException('Could not initialize glue job, error: Specify Parameter `s3_bucket`')
s3_log_path = f's3://{self.s3_bucket}/{self.s3_glue_logs}{self.job_name}'
execution_role = self.get_iam_execution_role()
try:
create_job_response = glue_client.create_job(
Name=self.job_name,
Description=self.desc,
LogUri=s3_log_path,
Role=execution_role['Role']['RoleName'],
ExecutionProperty={"MaxConcurrentRuns": self.concurrent_run_limit},
Command={"Name": "glueetl", "ScriptLocation": self.script_location},
MaxRetries=self.retry_limit,
AllocatedCapacity=self.num_of_dpus,
**self.create_job_kwargs,
)
return create_job_response['Name']
except Exception as general_error:
self.log.error("Failed to create aws glue job, error: %s", general_error)
raise
The get_job method does not accept a region flag here so the region flag passed into the AwsGlueJobOperator doesn’t apply. This results in the method failing if we’re expecting the job to be in a different region other than the default one in the credential chain.
Work around we can set the AWS_DEFAULT_REGION environment variable in python using this line before the task runs. However this might have issues if there are concurrent dags running on the same machine and race conditions might happen.
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Looking at the code, the
region_name
from the AwsGlueOperator is handled in the constructor – none of the methods in the Aws hooks should take a region_name parameter.And the operator take a region_name https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/_api/airflow/providers/amazon/aws/operators/glue/index.html?highlight=glue#airflow.providers.amazon.aws.operators.glue.AwsGlueJobOperator and passes it on, so from looking at the code, it should work like this:
self.region_name
property of the hook.This issue has been closed because it has not received response from the issue author.