MySQL Persistence should retry on Deadlock
See original GitHub issueShould mysql operation in withTransaction
be wrapped inside RetryUtil.retryOnException
so that issue like can be retried on the spot rather than bubble up all the way to WorkflowExecutor:
at org.eclipse.jetty.server.Server.handle(Server.java:524)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.netflix.conductor.core.execution.ApplicationException: BACKEND_ERROR - Deadlock found when trying to get lock; try restarting transaction
at com.netflix.conductor.dao.mysql.MySQLBaseDAO.getWithTransaction(MySQLBaseDAO.java:103)
at com.netflix.conductor.dao.mysql.MySQLBaseDAO.withTransaction(MySQLBaseDAO.java:152)
at com.netflix.conductor.dao.mysql.MySQLExecutionDAO.updateTask(MySQLExecutionDAO.java:137)
at com.netflix.conductor.core.orchestration.ExecutionDAOFacade.updateTask(ExecutionDAOFacade.java:250)
... 51 more
Caused by: com.netflix.conductor.core.execution.ApplicationException: Deadlock found when trying to get lock; try restarting transaction
at com.netflix.conductor.dao.mysql.Query.executeUpdate(Query.java:276)
at com.netflix.conductor.dao.mysql.MySQLExecutionDAO.lambda$addWorkflowToTaskMapping$37(MySQLExecutionDAO.java:584)
at com.netflix.conductor.dao.mysql.MySQLBaseDAO.execute(MySQLBaseDAO.java:197)
at com.netflix.conductor.dao.mysql.MySQLExecutionDAO.addWorkflowToTaskMapping(MySQLExecutionDAO.java:583)
at com.netflix.conductor.dao.mysql.MySQLExecutionDAO.updateTask(MySQLExecutionDAO.java:523)
at com.netflix.conductor.dao.mysql.MySQLExecutionDAO.lambda$updateTask$2(MySQLExecutionDAO.java:137)
at com.netflix.conductor.dao.mysql.MySQLBaseDAO.lambda$withTransaction$3(MySQLBaseDAO.java:153)
at com.netflix.conductor.dao.mysql.MySQLBaseDAO.getWithTransaction(MySQLBaseDAO.java:98)
... 54 more
I think we wrap ES operations but not MySQL for some reason, the Exception above was recorded on v2.3.15. A specific case of Deadlock before https://github.com/Netflix/conductor/issues/576 (where we prefer not to synchronize), but I think it will happen occasionally for mysql persistence here and there, it’s best transactions are wrapped in retries.
“Always be prepared to re-issue a transaction if it fails due to deadlock. Deadlocks are not dangerous. Just try again.” - per https://dev.mysql.com/doc/refman/5.7/en/innodb-deadlocks-handling.html

Sample deadlock captured:
2019-08-19 06:27:20 0x7f979bce7700
*** (1) TRANSACTION:
TRANSACTION 55204, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 22 lock struct(s), heap size 3520, 20 row lock(s), undo log entries 9
MySQL thread id 476, OS thread handle 140289201673984, query id 448134 172.17.0.1 conductor update
INSERT IGNORE INTO workflow_to_task (workflow_id, task_id) VALUES ('9c3e5781-0a7c-41e0-aced-422d0bcf9f59', '4d275114-b7c2-4740-9cdb-31a76758d645')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 38 page no 9748 n bits 168 index PRIMARY of table `conductor`.`workflow_to_task` trx id 55204 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** (2) TRANSACTION:
TRANSACTION 55081, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
31 lock struct(s), heap size 8400, 27 row lock(s), undo log entries 14
MySQL thread id 460, OS thread handle 140289130788608, query id 448141 172.17.0.1 conductor update
INSERT INTO task (task_id, json_data, modified_on) VALUES ('d02c997e-4e55-46a6-baf7-eed265211304', '{"taskType":"someTask","status":"SCHEDULED","inputData":{"media_metadata":{"segments":[{"segType":1,"title":"First Segment","startOfMessageHours":0,"startOfMessageMinutes":1,"startOfMessageSeconds":0,"startOfMessageFrames":0,"endOfMessageHours":1,"endOfMessageMinutes":21,"endOfMessageSeconds":34,"endOfMessageFrames":0},{"segType":12,"title":null,"startOfMessageHours":0,"startOfMessageMinutes":1,"startOfMessageSeconds":30,"startOfMessageFrames":0,"endOfMessageHours":1,"endOfMessageMinutes":20,"endOfMessageSeconds":34,"endOfMessageFrames":0}],"identifier":"P262391","title":"Material Title"}},"referenceTaskName":"P262391","retryCount":0,"seq":3,"pollCount":0,"taskDefName":"someTask","scheduledTime":1566196040354,"startTime":0,"endTime":0,"updateTime":0,"startDelayInS
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 38 page no 9748 n bits 168 index PRIMARY of table `conductor`.`workflow_to_task` trx id 55081 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 32 page no 16046 n bits 80 index PRIMARY of table `conductor`.`task` trx id 55081 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
*** WE ROLL BACK TRANSACTION (1)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:5 (4 by maintainers)
@s50600822 Here are the guidelines for contributions: https://github.com/Netflix/conductor/blob/master/CONTRIBUTING.md, Thank you.
@s50600822 Definitely makes sense to add retries here. Please feel free to submit a PR when you have a chance.