Create batch version of the AWSAthenaOperator
See original GitHub issueDescription
Create batch version of the AWSAthenaOperator that can accept multiple queries and execute them
Use case / motivation
Currently, the AWSAthenaOperator is built to handle one query and poll for its success. In the event that you have multiple queries to execute via Athena, you must move logic into the dag to run the AWSAthenaOperator in a for loop over a number of queries. This is not best practice in the event that you have a task that generates batch queries to be submitted to Athena.
This issue proposes that an AWSAthenaBatchOperator be created that can execute a batch of queries. This would allow Airflow users to contain logic to the tasks instead of the dags.
A first take on creating a new operator like this:
class AWSAthenaBatchOperator(BaseOperator):
"""
An operator that submit a batch of presto queries to athena for the same database.
If ``do_xcom_push`` is True, the QueryExecutionID assigned to the
query will be pushed to an XCom when it successfuly completes.
:param query: Presto to be run on athena. (templated)
:type queries: str demlinited by ";\n"
:param database: Database to select. (templated)
:type database: str
:param output_location: s3 path to write the query results into. (templated)
:type output_location: str
:param aws_conn_id: aws connection to use
:type aws_conn_id: str
:param sleep_time: Time to wait between two consecutive call to check query status on athena
:type sleep_time: int
:param max_tries: Number of times to poll for query state before function exits
:type max_triex: int
"""
ui_color = '#44b5e2'
template_fields = ('query', 'database', 'output_location')
template_ext = ('.sql', )
@apply_defaults
def __init__( # pylint: disable=too-many-arguments
self,
queries,
database,
output_location,
aws_conn_id="aws_default",
workgroup="primary",
query_execution_context=None,
result_configuration=None,
sleep_time=30,
max_tries=None,
*args,
**kwargs
):
super().__init__(*args, **kwargs)
self.queries = queries
self.database = database
self.output_location = output_location
self.aws_conn_id = aws_conn_id
self.workgroup = workgroup
self.query_execution_context = query_execution_context or {}
self.result_configuration = result_configuration or {}
self.sleep_time = sleep_time
self.max_tries = max_tries
self.query_execution_id = None
self.hook = None
self.query_execution_ids = []
def get_hook(self):
"""Create and return an AWSAthenaHook."""
return AWSAthenaHook(self.aws_conn_id, self.sleep_time)
def execute(self, context):
"""
Run Presto Query on Athena
"""
self.hook = self.get_hook()
self.query_execution_context['Database'] = self.database
self.result_configuration['OutputLocation'] = self.output_location
batch = self.queries.split(";\n")
for query in batch:
self.client_request_token = str(uuid4()) # new each time for idempotency
self.query_execution_id = self.hook.run_query(self.query, self.query_execution_context,
self.result_configuration, self.client_request_token,
self.workgroup)
query_status = self.hook.poll_query_status(self.query_execution_id, self.max_tries)
if query_status in AWSAthenaHook.FAILURE_STATES:
error_message = self.hook.get_state_change_reason(self.query_execution_id)
raise Exception(
'Final state of Athena job is {}, query_execution_id is {}. Error: {}'
.format(query_status, self.query_execution_id, error_message))
elif not query_status or query_status in AWSAthenaHook.INTERMEDIATE_STATES:
raise Exception(
'Final state of Athena job is {}. '
'Max tries of poll status exceeded, query_execution_id is {}.'
.format(query_status, self.query_execution_id))
self.query_execution_ids.append(self.query_execution_id)
return query_execution_ids
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
[GitHub] [airflow] potiuk commented on issue #8449: Create batch ...
[GitHub] [airflow] potiuk commented on issue #8449: Create batch version of the AWSAthenaOperator · GitBox Sat, 09 Oct 2021 07:13:57 -0700.
Read more >apache-airflow-providers-amazon Documentation
Move min airflow version to 2.3.0 for all providers (#27196) ... AwsAthenaOperator: do not generate ''client_request_token'' if not provided (#20854).
Read more >4 Creating Batch Versions
You can create batch versions automatically in JD Edwards EnterpriseOne Report Design Aid (RDA) using the Report Director. You can create additional batch...
Read more >Introducing Amazon Managed Workflows for Apache Airflow ...
Amazon MWAA provides automatic minor version upgrades and patches by default, ... How to Create an Airflow Environment Using Amazon MWAA
Read more >pip install apache-airflow-providers-amazon==1.0.0 - PyPI
Commit Committed Subject
b40dffa08 2020‑12‑08 Rename remaing modules to match AIP‑21 (#12917)
663259d4b 2020‑11‑25 Fix AWS DataSync tests failing (#11020)
370e7d07d 2020‑11‑21 Fix Python Docstring parameters...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes. That’s perfectly fine and even preferred.
Fill free @javatarz ! I think it is important to get “How to guides” with examples as part of documentation (many of the operators have them already), but that’s about it when it comes to recommendation.