question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Create batch version of the AWSAthenaOperator

See original GitHub issue

Description

Create batch version of the AWSAthenaOperator that can accept multiple queries and execute them

Use case / motivation

Currently, the AWSAthenaOperator is built to handle one query and poll for its success. In the event that you have multiple queries to execute via Athena, you must move logic into the dag to run the AWSAthenaOperator in a for loop over a number of queries. This is not best practice in the event that you have a task that generates batch queries to be submitted to Athena.

This issue proposes that an AWSAthenaBatchOperator be created that can execute a batch of queries. This would allow Airflow users to contain logic to the tasks instead of the dags.

A first take on creating a new operator like this:

class AWSAthenaBatchOperator(BaseOperator):
    """
    An operator that submit a batch of presto queries to athena for the same database.
    If ``do_xcom_push`` is True, the QueryExecutionID assigned to the
    query will be pushed to an XCom when it successfuly completes.
    :param query: Presto to be run on athena. (templated)
    :type queries: str demlinited by ";\n"
    :param database: Database to select. (templated)
    :type database: str
    :param output_location: s3 path to write the query results into. (templated)
    :type output_location: str
    :param aws_conn_id: aws connection to use
    :type aws_conn_id: str
    :param sleep_time: Time to wait between two consecutive call to check query status on athena
    :type sleep_time: int
    :param max_tries: Number of times to poll for query state before function exits
    :type max_triex: int
    """

    ui_color = '#44b5e2'
    template_fields = ('query', 'database', 'output_location')
    template_ext = ('.sql', )

    @apply_defaults
    def __init__(  # pylint: disable=too-many-arguments
        self,
        queries,
        database,
        output_location,
        aws_conn_id="aws_default",
        workgroup="primary",
        query_execution_context=None,
        result_configuration=None,
        sleep_time=30,
        max_tries=None,
        *args,
        **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.queries = queries
        self.database = database
        self.output_location = output_location
        self.aws_conn_id = aws_conn_id
        self.workgroup = workgroup
        self.query_execution_context = query_execution_context or {}
        self.result_configuration = result_configuration or {}
        self.sleep_time = sleep_time
        self.max_tries = max_tries
        self.query_execution_id = None
        self.hook = None
        self.query_execution_ids = []

    def get_hook(self):
        """Create and return an AWSAthenaHook."""
        return AWSAthenaHook(self.aws_conn_id, self.sleep_time)

    def execute(self, context):
        """
        Run Presto Query on Athena
        """
        self.hook = self.get_hook()

        self.query_execution_context['Database'] = self.database
        self.result_configuration['OutputLocation'] = self.output_location

        batch = self.queries.split(";\n")

        for query in batch:
                self.client_request_token = str(uuid4())  # new each time for idempotency
                self.query_execution_id = self.hook.run_query(self.query, self.query_execution_context,
                                                            self.result_configuration, self.client_request_token,
                                                             self.workgroup)
                query_status = self.hook.poll_query_status(self.query_execution_id, self.max_tries)

                if query_status in AWSAthenaHook.FAILURE_STATES:
                    error_message = self.hook.get_state_change_reason(self.query_execution_id)
                    raise Exception(
                        'Final state of Athena job is {}, query_execution_id is {}. Error: {}'
                        .format(query_status, self.query_execution_id, error_message))
                elif not query_status or query_status in AWSAthenaHook.INTERMEDIATE_STATES:
                    raise Exception(
                        'Final state of Athena job is {}. '
                        'Max tries of poll status exceeded, query_execution_id is {}.'
                        .format(query_status, self.query_execution_id))
                self.query_execution_ids.append(self.query_execution_id)

        return query_execution_ids 

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Oct 9, 2021

@potiuk: If I can make sure the operator is backward compatible and clearly documented, would it be OK to update the existing operator?

Yes. That’s perfectly fine and even preferred.

1reaction
potiukcommented, Oct 9, 2021

Fill free @javatarz ! I think it is important to get “How to guides” with examples as part of documentation (many of the operators have them already), but that’s about it when it comes to recommendation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] potiuk commented on issue #8449: Create batch ...
[GitHub] [airflow] potiuk commented on issue #8449: Create batch version of the AWSAthenaOperator · GitBox Sat, 09 Oct 2021 07:13:57 -0700.
Read more >
apache-airflow-providers-amazon Documentation
Move min airflow version to 2.3.0 for all providers (#27196) ... AwsAthenaOperator: do not generate ''client_request_token'' if not provided (#20854).
Read more >
4 Creating Batch Versions
You can create batch versions automatically in JD Edwards EnterpriseOne Report Design Aid (RDA) using the Report Director. You can create additional batch...
Read more >
Introducing Amazon Managed Workflows for Apache Airflow ...
Amazon MWAA provides automatic minor version upgrades and patches by default, ... How to Create an Airflow Environment Using Amazon MWAA
Read more >
pip install apache-airflow-providers-amazon==1.0.0 - PyPI
Commit Committed Subject b40dffa08 2020‑12‑08 Rename remaing modules to match AIP‑21 (#12917) 663259d4b 2020‑11‑25 Fix AWS DataSync tests failing (#11020) 370e7d07d 2020‑11‑21 Fix Python Docstring parameters...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found