Rewind fails to rewind activities that were called with retry
See original GitHub issueWhen calling an activity with retry (via ScheduleTask
via CallActivityWithRetryAsync
, from durable task extension), the RetryInterceptor
retries the activities if they fail. For each retry, a TimerCreated
(and consequent TimerFired
) events are added to the history of the orchestration.
(And due to the following:
if (isLastRetry)
{
// Earlier versions of this retry interceptor had a bug that scheduled an extra delay timer.
// It's unfortunately not possible to remove the extra timer since that would potentially
// break the history replay for existing orchestrations. Instead, we do the next best thing
// and schedule a timer that fires immediately instead of waiting for a full delay interval.
await this.context.CreateTimer(this.context.CurrentUtcDateTime, "Dummy timer for back-compat");
break;
}
after the last attempt, another TimerEvent
and TimerFired
event are added).
When AzureTableTrackingStore.RewindHistoryAsync
is called to rewind the orchestration, only TaskFailed
and SubOrchestrationInstanceFailed
(and their corresponding TaskScheduled
and SubOrchestrationInstanceCreated
) get their EventType
reset to GenericEvent
. So when the orchestration restarts, it encounters TimerCreated
and TimerFired
events that it did not expect, and causes the following error:
Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a
timer task with sequence number 1 but the current replay execution hasn't (yet?) scheduled this
task. Was a change made to the orchestrator code after this instance had already startedrunning?
I think to fix this, the rewind algorithm should take the timer events into account, and also overwrite their EventType
to GenericEvent
. I’ve tested this by modifying the table storage entries before rewinding and that works. I can imagine that the fix is to find all the TimerCreated
events that have an EventId
higher than the TaskScheduled
that is being reset. The corresponding TimerFired
events can be found using the TimerId
property.
I don’t mind implementing the fix for this, but I would like to know if this is the best approach. I can imagine that this change can inadvertently reset some timers it should not touch. But as the Rewind algorithm just resets all the TaskFailed
events, resetting the timer events after those events might just work fine.
Issue Analytics
- State:
- Created a year ago
- Comments:8
Top GitHub Comments
Adding @lilyjma, who’s helping manage our backlog.
We accept pull requests. However, one of our goals for improving this feature is to rewrite it so that it’s simpler and works for all backend types (Azure Storage, Netherite, MSSQL, etc.). The currently implementation only works for Azure Storage. There is a brief proposal here if you’re interested in taking a look and potentially contributing: https://github.com/Azure/durabletask/issues/731.
No updates. This item unfortunately hasn’t made it high enough in the team’s backlog.