question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MySQL deadlock when using DAG serialization

See original GitHub issue

Apache Airflow version: 1.10.10 Kubernetes version: v1.16.8 MySQL version: 5.7

What happened: Airflow tasks fail with Deadlock when running Dag with max_active_runs > 1 and concurrency > 1 and when dag_serialization is enabled.

Logs

[2020-04-22 19:19:49,018] {taskinstance.py:1145} ERROR - (_mysql_exceptions.OperationalError) (1205, ‘Lock wait timeout exceeded; try restarting transaction’) [SQL: INSERT INTO rendered_task_instance_fields (dag_id, task_id, execution_date, rendered_fields) VALUES (%s, %s, %s, %s)] [parameters: (‘some_dag_v.0.0.1’, ‘some_task_id’, datetime.datetime(2019, 12, 2, 0, 0), ‘Some rendered fields (837 characters truncated)’)]

(Background on this error at: http://sqlalche.me/e/e3q8) Traceback (most recent call last): File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 1248, in _execute_context cursor, statement, parameters, context File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py”, line 590, in do_execute cursor.execute(statement, parameters) File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 255, in execute self.errorhandler(self, exc, value) File “/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py”, line 50, in defaulterrorhandler raise errorvalue File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 252, in execute res = self._query(query) File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 378, in _query db.query(q) File “/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py”, line 280, in query _mysql.connection.query(self, query) _mysql_exceptions.OperationalError: (1205, ‘Lock wait timeout exceeded; try restarting transaction’)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File “/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py”, line 1002, in _run_raw_task self.refresh_from_db(lock_for_update=True) File “/usr/local/lib/python3.7/site-packages/airflow/utils/db.py”, line 74, in wrapper return func(*args, **kwargs) File “/usr/local/lib/python3.7/contextlib.py”, line 119, in exit next(self.gen) File “/usr/local/lib/python3.7/site-packages/airflow/utils/db.py”, line 45, in create_session session.commit() File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 1036, in commit self.transaction.commit() File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 503, in commit self._prepare_impl() File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 482, in _prepare_impl self.session.flush() File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 2496, in flush self._flush(objects) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 2637, in _flush transaction.rollback(capture_exception=True) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py”, line 69, in exit exc_value, with_traceback=exc_tb, File “/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py”, line 178, in raise raise exception File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py”, line 2597, in _flush flush_context.execute() File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py”, line 422, in execute rec.execute(self) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py”, line 589, in execute uow, File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py”, line 245, in save_obj insert, File “/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py”, line 1083, in _emit_insert_statements c = cached_connections[connection].execute(statement, multiparams) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 984, in execute return meth(self, multiparams, params) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py”, line 293, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 1103, in _execute_clauseelement distilled_params, File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 1288, in execute_context e, statement, parameters, cursor, context File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 1482, in handle_dbapi_exception sqlalchemy_exception, with_traceback=exc_info[2], from=e File “/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py”, line 178, in raise raise exception File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py”, line 1248, in _execute_context cursor, statement, parameters, context File “/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py”, line 590, in do_execute cursor.execute(statement, parameters) File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 255, in execute self.errorhandler(self, exc, value) File “/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py”, line 50, in defaulterrorhandler raise errorvalue File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 252, in execute res = self._query(query) File “/usr/local/lib/python3.7/site-packages/MySQLdb/cursors.py”, line 378, in _query db.query(q) File “/usr/local/lib/python3.7/site-packages/MySQLdb/connections.py”, line 280, in query _mysql.connection.query(self, query)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
ozw1z5rdcommented, Jul 30, 2020

Hi, I have the same issue.

I was looking in models/renderedtifields.py file and I noticed that

def delete_old_records(

contains a line that loads the number or rendered fields to keep:

num_to_keep=conf.getint("core", "max_num_rendered_ti_fields_per_task", fallback=0)

and if this value is <= 0 the function will return doing nothing.

 if num_to_keep <= 0:
     return

Since the dead lock is about the insert and the delete in that table, setting max_num_rendered_ti_fields_per_task = 0 inside the [core] config … perhaps can fix the issue.

Of course it does not work.

Using SHOW ENGINE INNODB STATUS I see queries like this:

DELETE FROM rendered_task_instance_fields 
WHERE rendered_task_instance_fields.dag_id = 'PARTITIONADD' 
AND rendered_task_instance_fields.task_id = 'partition_add' 
AND (rendered_task_instance_fields.dag_id, rendered_task_instance_fields.task_id, rendered_task_instance_fields.execution_date) NOT IN (
       SELECT subq1.dag_id, subq1.task_id, subq1.execution_date
      FROM (
             SELECT rendered_task_instance_fields.dag_id AS dag_id, rendered_task_instance_fields.task_id AS task_id,         
                           rendered_task_instance_fields.execution_date AS execution_date
              FROM rendered_task_instance_fields
             WHERE rendered_task_instance_fields.dag_id = 'PARTITIONADD' 
             AND rendered_task_instance_fields.task_id = 'partition_add' 
            ORDER BY rendered_task_instance_fields.execution_date DESC
            LIMIT 30
       ) 
AS subq1
)

-----> Please note LIMIT 30

I found this code inside models/taskinstance.py

 if STORE_SERIALIZED_DAGS:
     RTIF.write(RTIF(ti=self, render_templates=False), session=session)
     RTIF.delete_old_records(self.task_id, self.dag_id, session=session)

and it’s the unique place where delete_old_records is called, so … it is weird, is it not? from which point of the universe comes that “30”?

I’ll investigate better tomorrow…

1reaction
ozw1z5rdcommented, Jul 31, 2020

max_num_rendered_ti_fields_per_task = 0 seems that fixed my problems. Of course can only be a temporary fix. I moved the table cleaning to external task.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MySQL 8.0 Reference Manual :: 15.7.5 Deadlocks in InnoDB
A deadlock is a situation where different transactions are unable to proceed because each holds a lock that the other needs. Because both...
Read more >
[GitHub] [airflow] jledru opened a new issue #16453
OperationalError) (1213, 'Deadlock found when trying to get lock; ... No pattern found except DAG Serialization, happen from time to time.
Read more >
MySQL used in Airflow comes to a lot of deadlock
Deadlocks can occur when the same rows are locked in different orders by different transactions. Normalize -- Another way to shrink tables is...
Read more >
Scheduler — Airflow Documentation - Apache Airflow
If you run a DAG on a schedule_interval of one day, the run with ... are deadlocked, so running with more than a...
Read more >
Changelog - Apache Airflow Documentation
Use Hash of Serialized DAG to determine DAG is changed or not (#10227) ... [AIRFLOW-2516] Fix mysql deadlocks (#6988).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found