xcom table deadlocks (error 1205) due to lack of index
See original GitHub issueEnvironment: Airflow Version: 2.1.4 Cloud provider or hardware configuration: AWS hosting EC2 server/worker instances on Docker containers OS: Linux, current build of https://github.com/docker-library/repo-info/blob/master/repos/python/remote/3.7-slim.md#python37-slim, e.g. Linux 103b100bee6b Kernel: 4.14.203-116.332.amzn1.x86_64 Install tools: Others: MariaDB 10.3.31 on RDS
What Happened:
When migrating from 2.1.2 to 2.1.4, the migration modified indices on the xcom table, adding a primary key on dag_id, task_id, key, and execution_date and removing a separate index on dag_id, task_id, and execution_date
My team’s specific implementation runs ~1000 of the same dag_ids simultaneously with different configurations, so with only the primary key on dag_id, task_id, key, and execution_date, every xcom update query was only able to narrow as far as dag_id + task_id, leaving over 100k rows to scan for a matching execution_date. All of our tasks update xcom with status codes, and many of the tasks have similar run times across different dag runs, leading to large numbers of concurrent requests with execution_date as the only distinguishing factor, and tasks intermittently failing due to deadlocks on the xcom table.
How to reproduce it:
Uncertain, but I would suggest creating a simple DAG with 2-5 tasks. Each task should update the xcom table with its status and read the xcom table to retrieve the status of the preceding task. Running approximately 1000 instances of the dag simultaneously several times per day should intermittently reproduce the issue after enough runs have accrued that more than 100k rows need to be scanned to find an individual execution_id or run_id value
Alternatively, artificially fill xcom table with ~150k rows having the same dag_id and task_id, then attempt to make 1000 near-concurrent queries by dag_id, task_id, and execution_date/run_id where each request has the same dag_id and task_id but a unique run_id.
Anything Else We Need to Know:
We patched the issue by adding back the separate index to our xcom table on dag_id, task_id, and execution_date. My guess is that there may be a more efficient index scheme, but this has resolved the deadlocking behavior so far.
My understanding is that in the latest airflow version, execution_date has been replaced with run_id, but the overall scenario would be the same: where a system performs large numbers of concurrent runs of the same dags and tasks, the xcom table needs to be able to look up individual runs from an index to avoid scanning many rows and potentially deadlocking.
_Originally posted by @patrickbrady-xaxis in https://github.com/apache/airflow/issues/16982#issuecomment-1035050661_
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)

Top Related StackOverflow Question
Let’s see how 2.3 works in the real world frist before doing anything.
The situation may have changed since we changed XCom’s primary key. I’ll try to find some time to review this if nobody else does.