question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

xcom table deadlocks (error 1205) due to lack of index

See original GitHub issue

Environment: Airflow Version: 2.1.4 Cloud provider or hardware configuration: AWS hosting EC2 server/worker instances on Docker containers OS: Linux, current build of https://github.com/docker-library/repo-info/blob/master/repos/python/remote/3.7-slim.md#python37-slim, e.g. Linux 103b100bee6b Kernel: 4.14.203-116.332.amzn1.x86_64 Install tools: Others: MariaDB 10.3.31 on RDS

What Happened: When migrating from 2.1.2 to 2.1.4, the migration modified indices on the xcom table, adding a primary key on dag_id, task_id, key, and execution_date and removing a separate index on dag_id, task_id, and execution_date

My team’s specific implementation runs ~1000 of the same dag_ids simultaneously with different configurations, so with only the primary key on dag_id, task_id, key, and execution_date, every xcom update query was only able to narrow as far as dag_id + task_id, leaving over 100k rows to scan for a matching execution_date. All of our tasks update xcom with status codes, and many of the tasks have similar run times across different dag runs, leading to large numbers of concurrent requests with execution_date as the only distinguishing factor, and tasks intermittently failing due to deadlocks on the xcom table.

How to reproduce it: Uncertain, but I would suggest creating a simple DAG with 2-5 tasks. Each task should update the xcom table with its status and read the xcom table to retrieve the status of the preceding task. Running approximately 1000 instances of the dag simultaneously several times per day should intermittently reproduce the issue after enough runs have accrued that more than 100k rows need to be scanned to find an individual execution_id or run_id value

Alternatively, artificially fill xcom table with ~150k rows having the same dag_id and task_id, then attempt to make 1000 near-concurrent queries by dag_id, task_id, and execution_date/run_id where each request has the same dag_id and task_id but a unique run_id.

Anything Else We Need to Know: We patched the issue by adding back the separate index to our xcom table on dag_id, task_id, and execution_date. My guess is that there may be a more efficient index scheme, but this has resolved the deadlocking behavior so far.

My understanding is that in the latest airflow version, execution_date has been replaced with run_id, but the overall scenario would be the same: where a system performs large numbers of concurrent runs of the same dags and tasks, the xcom table needs to be able to look up individual runs from an index to avoid scanning many rows and potentially deadlocking.

_Originally posted by @patrickbrady-xaxis in https://github.com/apache/airflow/issues/16982#issuecomment-1035050661_

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
uranusjrcommented, Apr 11, 2022

Let’s see how 2.3 works in the real world frist before doing anything.

1reaction
uranusjrcommented, Feb 18, 2022

The situation may have changed since we changed XCom’s primary key. I’ll try to find some time to review this if nobody else does.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error 1205 - Sybase Infocenter
Deadlocks are caused by a number of situations, including: Transactions modify tables in different orders. There is a greater chance of deadlock between...
Read more >
KB3052167 - FIX: Error 1205 when you execute parallel query ...
Fixes an issue in which Error 1205 occurs when you execute parallel query that contains outer join operators in SQL Server 2014.
Read more >
Untitled
Larry parker usc football, Hourly uv index boston, Timespace the best of ... We're not that drunk quote, Assault rifle comparison chart, Tirantez...
Read more >
Untitled
Juices for weight loss and glowing skin, Price of 2007 napa valley silver oak, ... Caf champions league current table, Honey bunch lyrics,...
Read more >
Linux - CVE - Search Results
CVE-2022-23992, XCOM Data Transport for Windows, Linux, and UNIX 11.6 releases contain a vulnerability due to insufficient input validation that could ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found