question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle batch queries in `DbApiHook`?

See original GitHub issue

Body

TL;DR: Multiple providers need to handle the logic of splitting SQL statements Snowflake.run() , RedShift.run() and it will be very similar should be eventually create PrestoOperator or TrinoOperator.

I want to discuss the idea of extracting the split statement logic into DbApiHook. e.g create a DbApiHook.run_many() that will split the statements and use DbApiHook.run() under the hood.

On one hand it will reduce duplicate code and allow to develop operators for these providers more easily On the other hand if we will make it in dbapihook it means to bump minimum supported Airflow version for these providers (once they will actually use the new function)


The long story: Some DBs like MySQL, PostgreSql, MsSQL are able to run a batch of SQL statements: SELECT 1; SELECT2;

So you can do: MySQLOperator(.., sql='SELECT 1; SELECT2;') it will invoke : https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/hooks/dbapi.py#L163

which will execute the sql in a single query. (The run function will convert this sql to ['SELECT 1; SELECT2;'] - Note this is a list with 1 cell)

But in other DBs like: Snowflake, Redshift, Trino, Presto

You can not run SELECT 1; SELECT2; as is. You must split the statement. This makes the run() irrelevant for these DBs.

For example in Snowflake when users pass SELECT 1; SELECT2; what we actually do is create: ['SELECT 1;', 'SELECT2;'] - Note this is a list with 2 cells https://github.com/apache/airflow/blob/501a3c3fbefbcc0d6071a00eb101110fc4733e08/airflow/providers/snowflake/hooks/snowflake.py#L314-L316

The problem? These providers (Snowflake, Redshift, Trino, Presto) are forced to override dbapi.run() but they all actually need the exact same functionality.

If you will take a look at Snowflake.run() you will see that it’s almost identical to RedShift.run() and it will be very similar should be eventually create PrestoOperator or TrinoOperator.

This is something already brought up earlier in https://github.com/apache/airflow/pull/15533#discussion_r620481685

So why not just set the sql param to accept list only? Because we want to support using .sql file so we must handle the split of the statements.

SomeSqlOperator(
...,
sql=my_queries.sql
)

We are reading the content of the file and loading it as a single statement.


What do others think? Should we create DbApiHook.run_many() or leave it as it is today where every provider needs to handle the logic on it’s own?

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
eladkalcommented, Jul 10, 2022

We are waiting for https://github.com/apache/airflow/pull/23971 to be finished first to see what actions needs to be taken here

1reaction
eladkalcommented, May 5, 2022

I wish @uranusjr wouldn’t have closed the issue I opened so quickly #23431. That I can fix quickly. This one might have some nuances that are a bit complicated for me to resolve on my own.

When we spot a problem we prefer to fix it from the root. Supporting temporary fixes means holding in with the actual fix and I see no reason to do so.

In your case, you are not blocked. You can create custom hook and operator to reaolve your issue. I actually agree that we should not try to solve this problem per provider.

You can start a draft PR to handle this issue. During review we can discuss the edge cases and see what we can do about them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] williesb commented on issue #23112: Handle ...
... williesb commented on issue #23112: Handle batch queries in `DbApiHook`? ... I am stuck at the moment not being able to run...
Read more >
airflow.providers.common.sql.hooks.sql - Apache Airflow
Determines when results of single query only should be returned. fetch_all_handler (cursor). Handler for DbApiHook.run() to return results.
Read more >
Batching Client GraphQL Queries
Batching is the process of taking a group of requests, combining them into one, and making a single request with the same data...
Read more >
dbapihook, airflow postgreshook, airflow mysqlhook github ...
airflow query database. Airflow provides many plug-and-play operators that are ready to handle your task on Google Cloud Platform, Amazon Web Services, ...
Read more >
Limit the number of batch queries - IBM
If your site sees a lot of high traffic job querying, you can tune LSF to limit the number of job queries that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found