question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow scheduler crashes due to 'duplicate key value violates unique constraint "task_instance_pkey"'

See original GitHub issue

Apache Airflow version

2.3.3 (latest released)

What happened

Our schedulers have crashed on two occasions after upgrading to Airflow 2.3.3. The same DAG is responsible each time, but this is likely due to the fact that it is the only dynamic task mapping DAG running right now (catching up some historical data). This DAG uses the same imported @task function that many other DAGs used successfully with no errors. The issue has only occurred after upgrading to Airflow 2.3.3

What you think should happen instead

This error should not be raised - there should be no record of this task instance because, according to the UI, the task has not run yet. The extract task is green but the transform task which raised the error is blank. The DAG run is stuck in the running state until eventually the scheduler dies and the Airflow banner notifies me that there is no scheduler heartbeat.

Also, this same DAG (and other which use the same imported external @task function) ran for hours before the upgrade to Airflow 2.3.3.

[2022-07-13 22:49:55,880] {process_utils.py:75} INFO - Process psutil.Process(pid=143, status='terminated', started='22:49:52') (143) terminated with exit code None
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
[SQL: UPDATE task_instance SET map_index=%(map_index)s WHERE task_instance.task_id = %(task_instance_task_id)s AND task_instance.dag_id = %(task_instance_dag_id)s AND task_instance.run_id = %(task_instance_run_id)s AND task_instance.map_index = %(task_instance_map_index)s]
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 476, in checkout
    compat.raise_(
[2022-07-13 22:50:23,909] {scheduler_job.py:780} INFO - Exited execute loop
[parameters: {'map_index': 0, 'task_instance_task_id': 'transform', 'task_instance_dag_id': 'dag-id', 'task_instance_run_id': 'scheduled__2022-06-04T14:05:00+00:00', 'task_instance_map_index': -1}]
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    rec = pool._do_get()
    rec = pool._do_get()

  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    return dialect.connect(*cargs, **cparams)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
(Background on this error at: https://sqlalche.me/e/14/gkpj)
[parameters: {'map_index': 0, 'task_instance_task_id': 'transform', 'task_instance_dag_id': 'dag_id, 'task_instance_run_id': 'scheduled__2022-06-04T14:05:00+00:00', 'task_instance_map_index': -1}]
[SQL: UPDATE task_instance SET map_index=%(map_index)s WHERE task_instance.task_id = %(task_instance_task_id)s AND task_instance.dag_id = %(task_instance_dag_id)s AND task_instance.run_id = %(task_instance_run_id)s AND task_instance.map_index = %(task_instance_map_index)s]
[2022-07-13 22:49:25,323] {scheduler_job.py:780} INFO - Exited execute loop
    raise exception
[2022-07-13 22:49:56,001] {process_utils.py:240} INFO - Waiting up to 5 seconds for processes to exit...
[2022-07-13 22:49:56,014] {process_utils.py:75} INFO - Process psutil.Process(pid=144, status='terminated', started='22:49:52') (144) terminated with exit code None
    compat.raise_(
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
[2022-07-13 22:49:56,016] {process_utils.py:75} INFO - Process psutil.Process(pid=140, status='terminated', exitcode=0, started='22:49:51') (140) terminated with exit code 0
    self.dialect.do_execute(
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
[2022-07-13 22:49:56,018] {scheduler_job.py:780} INFO - Exited execute loop
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute
    self.dialect.do_execute(
Traceback (most recent call last):
    cursor.execute(statement, parameters)
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "task_instance_pkey"
    cursor.execute(statement, parameters)
    self.dialect.do_execute(
DETAIL:  Key (dag_id, task_id, run_id, map_index)=(oportun-five9-calls-ccvcc-v1, transform, scheduled__2022-06-04T14:05:00+00:00, 0) already exists.
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "task_instance_pkey"
  File "/home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute

How to reproduce

Run a dynamic task mapping DAG in Airflow 2.3.3

Operating System

Ubuntu 20.04

Versions of Apache Airflow Providers

The 2.3.3 constraints file for Python 3.10 is used for the specific versions:

apache-airflow-providers-celery
apache-airflow-providers-docker
apache-airflow-providers-ftp
apache-airflow-providers-http
apache-airflow-providers-microsoft-azure
apache-airflow-providers-mysql
apache-airflow-providers-postgres
apache-airflow-providers-odbc
apache-airflow-providers-redis
apache-airflow-providers-salesforce
apache-airflow-providers-sftp
apache-airflow-providers-ssh
apache-airflow-providers-google

Deployment

Other Docker-based deployment

Deployment details

I am using two schedulers which run on separate nodes.

Anything else

The DAG only allows 1 max active DAG run at a time. catchup=True is enabled and it has been running to fill in all tasks since 05/10 start_date.

The extract() task returns a list of 1 or more files which have been saved on cloud storage. The transform task processes each of these paths dynamically. I have used these same tasks (imported from another file) for over 15 different DAGs so far without issue. The problem only occurred yesterday sometime after updating Airflow to 2.3.3.

def dag_name():
    retrieved = extract()
    transform = transform_files(retrieved)
    finalize = finalize_dataset(transform)
    consolidate = consolidate_staging(transform)

    retrieved >> transform >> finalize >> consolidate 

My transform_files task is just a function which expands the XCom Arg of the extract task and transforms each file. Nearly everything is based on DAG params which are customized in the DAG.


transform_file_task = task(process_data)

def transform_files(source):
    return (
        transform_file_task.override(task_id="transform")
        .partial(
            destination=f"{{{{ params.container }}}}/{{{{ dag.dag_id | dag_name }}}}/{{{{ ti.task_id }}}}",
            wasb_conn_id="{{ params.wasb_conn_id }}",
            pandas_options="{{ params.pandas_options}}",
            meta_columns="{{ params.meta_columns }}",
            script="{{ params.script }}",
            function_name="{{ params.function }}",
            schema_name="{{ params.schema }}",
            template=f"{{{{ dag.dag_id | dag_version }}}}-{{{{ ti.run_id }}}}-{{{{ ti.map_index }}}}",
            existing_data_behavior="overwrite_or_ignore",
            partition_columns="{{ params.partition_columns }}",
            dag_name="{{ dag.dag_id | dag_name }}",
            failure_recipients="{{ params.recipients }}",
            success_recipients="{{ params.recipients }}",
        )
        .expand(source=source)
    )

Deleting the DAG run which caused the error and restarting the Airflow scheduler fixes the issue temporarily. If I do not delete the DAG run then the scheduler will keep dying.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:8
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Aug 10, 2022

Mostly because when we solve the crash, we might have more data/information about the underlying problem (they are a bit clouded now and it’s even likely there are few different reasons that triggered this crash).

1reaction
potiukcommented, Aug 10, 2022

I think this issue is about “crash” which was the “strong” manifestation of the problem and did not let us see all the important details of the actual issue, but rather than re-openinig this one - I’d open another one solely to handle the duplicate task instance one if we can get hold of more information and logs on this one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

duplicate key value violates unique constraint when adding ...
I ran into the same problem when trying to do Variable.set() inside a DAG. I believe the scheduler will constantly poll the DagBag...
Read more >
Scheduler — Airflow Documentation - Apache Airflow
The Airflow scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete. Behind the scenes, the scheduler...
Read more >
[GitHub] [airflow] ashb commented on issue #13099
In the postgres logs we see > `ERROR: duplicate key value violates unique constraint "dag_run_dag_id_run_id_key" DETAIL: Key (dag_id, ...
Read more >
Failed to create user when initializing astro (version 0.9.2)
IntegrityError) duplicate key value violates unique constraint "ab_user_username_key" DETAIL: Key (username)=(admin) already exists.
Read more >
Troubleshooting DAGs | Cloud Composer
Some DAG executions issues might be caused by the Airflow scheduler not working ... The scheduler discovered a duplicate of a task and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found