question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler crashes with psycopg2.errors.DeadlockDetected exception

See original GitHub issue

Apache Airflow version

2.2.5 (latest released)

What happened

Customer has a dag that generates around 2500 tasks dynamically using a task group. While running the dag, a subset of the tasks (~1000) run successfully with no issue and (~1500) of the tasks are getting “skipped”, and the dag fails. The same DAG runs successfully in Airflow v2.1.3 with same Airflow configuration.

While investigating the Airflow processes, We found that both the scheduler got restarted with below error during the DAG execution.

[2022-04-27 20:42:44,347] {scheduler_job.py:742} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1256, in _execute_context
    self.dialect.do_executemany(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 912, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1646244 waits for ShareLock on transaction 3915993452; blocked by process 1640692.
Process 1640692 waits for ShareLock on transaction 3915992745; blocked by process 1646244.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (189873,4) in relation "task_instance"

This issue seems to be related to #19957

What you think should happen instead

This issue was observed while running huge number of concurrent task created dynamically by a DAG. Some of the tasks are getting skipped due to restart of scheduler with Deadlock exception.

How to reproduce

DAG file:

from propmix_listings_details import BUCKET, ZIPS_FOLDER, CITIES_ZIP_COL_NAME, DETAILS_DEV_LIMIT, DETAILS_RETRY, DETAILS_CONCURRENCY, get_api_token, get_values, process_listing_ids_based_zip
from airflow.utils.task_group import TaskGroup
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
}

date = '{{ execution_date }}'
email_to = ['example@airflow.com']
# Using a DAG context manager, you don't have to specify the dag property of each task

state = 'Maha'
with DAG('listings_details_generator_{0}'.format(state),
        start_date=datetime(2021, 11, 18),
        schedule_interval=None,
        max_active_runs=1,
        concurrency=DETAILS_CONCURRENCY,
        dagrun_timeout=timedelta(minutes=10),
        catchup=False  # enable if you don't want historical dag runs to run
        ) as dag:
    t0 = DummyOperator(task_id='start')

    with TaskGroup(group_id='group_1') as tg1:
        token = get_api_token()
        zip_list = get_values(BUCKET, ZIPS_FOLDER+state, CITIES_ZIP_COL_NAME)
        for zip in zip_list[0:DETAILS_DEV_LIMIT]:
            details_operator = PythonOperator(
                task_id='details_{0}_{1}'.format(state, zip),  # task id is generated dynamically
                pool='pm_details_pool',
                python_callable=process_listing_ids_based_zip,
                task_concurrency=40,
                retries=3,
                retry_delay=timedelta(seconds=10),
                op_kwargs={'zip': zip, 'date': date, 'token':token, 'state':state}
            )
            
    t0 >> tg1

Operating System

kubernetes cluster running on GCP linux (amd64)

Versions of Apache Airflow Providers

pip freeze | grep apache-airflow-providers

apache-airflow-providers-amazon==1!3.2.0 apache-airflow-providers-cncf-kubernetes==1!3.0.0 apache-airflow-providers-elasticsearch==1!2.2.0 apache-airflow-providers-ftp==1!2.1.2 apache-airflow-providers-google==1!6.7.0 apache-airflow-providers-http==1!2.1.2 apache-airflow-providers-imap==1!2.2.3 apache-airflow-providers-microsoft-azure==1!3.7.2 apache-airflow-providers-mysql==1!2.2.3 apache-airflow-providers-postgres==1!4.1.0 apache-airflow-providers-redis==1!2.0.4 apache-airflow-providers-slack==1!4.2.3 apache-airflow-providers-snowflake==2.6.0 apache-airflow-providers-sqlite==1!2.1.3 apache-airflow-providers-ssh==1!2.4.3

Deployment

Astronomer

Deployment details

Airflow v2.2.5-2 Scheduler count: 2 Scheduler resources: 20AU (2CPU and 7.5GB) Executor used: Celery Worker count : 2 Worker resources: 24AU (2.4 CPU and 9GB) Termination grace period : 2mins

Anything else

This issue happens in all the dag runs. Some of the tasks are getting skipped and some are getting succeeded and the scheduler fails with the Deadlock exception error.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:47 (33 by maintainers)

github_iconTop GitHub Comments

2reactions
potiukcommented, Jul 24, 2022

Thanks for REALLY detailed investigation @dstaple.

I finally had some time to take a look at this and I think your assesment was very correct.

However the solution you proposed is not good, because I think we DO want to run “SELECT FOR UPDATE” on DagRun table. The whole scheduling is based on the fact that DagRun row gets locked and no changes are happening to DagRun and any TaskInstances of that DagRun while Scheduler processes those task instances. And since local_task_run potentially changes the state of the task instance it runs (that’s why it locks it for update), if the whole task DagRun is currently “being processed” by any of the schedulers. we should hold-off with running the task before scheduler finishes this particular DagRun processing and releases the lock.

And in this case the “local_task_run” actually locks the DagRun table too (though I am not entirely sure why this is one thing that I do not understand completely - see below). So it does what it should but with one very little caveat - it locks the TaskInstance and DagRun in REVERSE ORDER comparing to what Scheduler does. This is actually the root cause of ALL Deadlocks (at least in Postgres, MySQL has it’s own fair share of other kinds of deadlocks) - non-consistent order. The deadlock appears when two threads want two (or more) resources and gets lock on them in reverse order. This is actually the only reason for any kind of deadlocks and your investigation was really nicely showing what’s going on.

The solution to that is simple - since we are going to get the DagRun lock in a moment anyway in “refresh_from_db”, we should simply get the lock on DagRun table FIRST. This should fix the problem as we will then perform lock grabbing in the same sequence in scheduler and task_run - > first DagRun, then TaskInstance. This is what my proposed #25266 does.

The only thing I do not know is WHY the TaskInstance.refresh_from_db actually does the JOIN query:

SELECT FROM task_instance JOIN dag_run AS dag_run_1 ON dag_run_1.dag_id = task_instance.dag_id AND dag_run_1.run_id ... FOR UPDATE

The original query in the code looks like this:

        qry = session.query(TaskInstance).filter(
            TaskInstance.dag_id == self.dag_id,
            TaskInstance.task_id == self.task_id,
            TaskInstance.run_id == self.run_id,
            TaskInstance.map_index == self.map_index,
        )

        if lock_for_update:
            for attempt in run_with_db_retries(logger=self.log):
                with attempt:
                    ti: Optional[TaskInstance] = qry.with_for_update().first()

And there is no obvious reason why the last line joins the dag_run table?

I hope someone else in this thread might shed some light on it, I have a suspicion, that SQLALchemy will add the join in case there is a ForeignKey with ONCASCADE with the dag_id (which we have) - but I could not find any reference or documentation that would point to such behaviour.

@RNHTTR - since you mentioned you can reproduce the issue - maybe you could apply my fix and see if it solves the problem (there is a bit of leap of faith with this change).

2reactions
potiukcommented, Jun 29, 2022

@ldacey @whitleykeith @argibbs @eitanme -> I spoke with some enlightened people 😃 (yeah talking about you @ashb and @bbovenzi ) -> and after the talk I have a hypothesis, that this is the new Grid view doing auto-refresh for a long running DAG.

There was a fix by @ashb https://github.com/apache/airflow/pull/24284 that is going to be released in 2.3.3 which decreases significantly a number of queries that are generated by the Grid view refresh. It’s a huge improvement and might impact both - load on the DB and possibly memory usage of the webserver - especially if there are almost continuously running dags and a number of people leaves the browser open with “auto-refresh” on the Grid View.

Is there a way some of you could test the hypothesis and see if there might be a correlation (requires a bit of coordination what your users do).

(BTW. If that’s it then Ash’s fix is coming in 2.3.3).

Read more comments on GitHub >

github_iconTop Results From Across the Web

AIrflow scheduler command crashes with sqlalchemy.exc ...
AIrflow scheduler command crashes with sqlalchemy.exc.InternalError: (psycopg2.errors.InFailedSqlTransaction).
Read more >
Exception classes mapping PostgreSQL errors - Psycopg
SQLSTATE Exception Base exception 02000 NoData DatabaseError 02001 NoAdditionalDynamicResultSetsReturned DatabaseError 03000 SqlStatementNotYetComplete DatabaseError
Read more >
Psycopg Documentation - Crunchy Data
In compliance with the DB API 2.0, the module makes informations about errors available through the following exceptions: exception psycopg2 ...
Read more >
[Solved]-Airflow scheduler stops with exception-postgresql
The PostgreSQL database server closed the connection unexpectedly. Look into the PostgreSQL log file for an explanation. There you will either find an...
Read more >
Bugs : Launchpad itself
#792185 Error when attempting to push to a packaging branch: Unexpected Zope exception. Launchpad itself 14. Critical. Triaged.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found