question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rewind fails to rewind activities that were called with retry

See original GitHub issue

When calling an activity with retry (via ScheduleTask via CallActivityWithRetryAsync, from durable task extension), the RetryInterceptor retries the activities if they fail. For each retry, a TimerCreated (and consequent TimerFired) events are added to the history of the orchestration.

(And due to the following:

if (isLastRetry)
{
    // Earlier versions of this retry interceptor had a bug that scheduled an extra delay timer.
    // It's unfortunately not possible to remove the extra timer since that would potentially
    // break the history replay for existing orchestrations. Instead, we do the next best thing
    // and schedule a timer that fires immediately instead of waiting for a full delay interval.
    await this.context.CreateTimer(this.context.CurrentUtcDateTime, "Dummy timer for back-compat");
    break;
}

after the last attempt, another TimerEvent and TimerFired event are added).

When AzureTableTrackingStore.RewindHistoryAsync is called to rewind the orchestration, only TaskFailed and SubOrchestrationInstanceFailed (and their corresponding TaskScheduled and SubOrchestrationInstanceCreated) get their EventType reset to GenericEvent. So when the orchestration restarts, it encounters TimerCreated and TimerFired events that it did not expect, and causes the following error:

Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a
timer task with sequence number 1 but the current replay execution hasn't (yet?) scheduled this 
task. Was a change made to the orchestrator code after this instance had already startedrunning?

I think to fix this, the rewind algorithm should take the timer events into account, and also overwrite their EventType to GenericEvent. I’ve tested this by modifying the table storage entries before rewinding and that works. I can imagine that the fix is to find all the TimerCreated events that have an EventId higher than the TaskScheduled that is being reset. The corresponding TimerFired events can be found using the TimerId property.

I don’t mind implementing the fix for this, but I would like to know if this is the best approach. I can imagine that this change can inadvertently reset some timers it should not touch. But as the Rewind algorithm just resets all the TaskFailed events, resetting the timer events after those events might just work fine.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
cgillumcommented, Mar 3, 2023

Adding @lilyjma, who’s helping manage our backlog.

We accept pull requests. However, one of our goals for improving this feature is to rewrite it so that it’s simpler and works for all backend types (Azure Storage, Netherite, MSSQL, etc.). The currently implementation only works for Azure Storage. There is a brief proposal here if you’re interested in taking a look and potentially contributing: https://github.com/Azure/durabletask/issues/731.

1reaction
cgillumcommented, Mar 3, 2023

No updates. This item unfortunately hasn’t made it high enough in the team’s backlog.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Rewind not supported when using ' ...
Problem. Trying to rewind a Failed Orchestrator (due to an activity failing) but the rewind is failing with the reason.
Read more >
Retrying Durable Function activity/orchastrator later
You should be able to query for failed instances, filter based on custom status and then call rewind on them.
Read more >
Adamm Mover Error: Unload Rewind Retry/Failure! - VOX
Hi all, We have been using Veritas Backup Exec 9.1 Rev 4691 for the past year and a half with out any problems....
Read more >
Rewind process using Admin api
Hi,. We have situation such that if process goes into particular fault we are suspending it .once the process is suspended using admin...
Read more >
Retries with Azure Durable Functions
With Durable functions, you have support for retries – when calling activity or other orchestration function (sub orchestration). You can customize within ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found