Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Rewind fails to rewind activities that were called with retry

See original GitHub issue

When calling an activity with retry (via ScheduleTask via CallActivityWithRetryAsync, from durable task extension), the RetryInterceptor retries the activities if they fail. For each retry, a TimerCreated (and consequent TimerFired) events are added to the history of the orchestration.

(And due to the following:

if (isLastRetry)
{
    // Earlier versions of this retry interceptor had a bug that scheduled an extra delay timer.
    // It's unfortunately not possible to remove the extra timer since that would potentially
    // break the history replay for existing orchestrations. Instead, we do the next best thing
    // and schedule a timer that fires immediately instead of waiting for a full delay interval.
    await this.context.CreateTimer(this.context.CurrentUtcDateTime, "Dummy timer for back-compat");
    break;
}

after the last attempt, another TimerEvent and TimerFired event are added).

When AzureTableTrackingStore.RewindHistoryAsync is called to rewind the orchestration, only TaskFailed and SubOrchestrationInstanceFailed (and their corresponding TaskScheduled and SubOrchestrationInstanceCreated) get their EventType reset to GenericEvent. So when the orchestration restarts, it encounters TimerCreated and TimerFired events that it did not expect, and causes the following error:

Non-Deterministic workflow detected: A previous execution of this orchestration scheduled a
timer task with sequence number 1 but the current replay execution hasn't (yet?) scheduled this 
task. Was a change made to the orchestrator code after this instance had already startedrunning?

I think to fix this, the rewind algorithm should take the timer events into account, and also overwrite their EventType to GenericEvent. I’ve tested this by modifying the table storage entries before rewinding and that works. I can imagine that the fix is to find all the TimerCreated events that have an EventId higher than the TaskScheduled that is being reset. The corresponding TimerFired events can be found using the TimerId property.

I don’t mind implementing the fix for this, but I would like to know if this is the best approach. I can imagine that this change can inadvertently reset some timers it should not touch. But as the Rewind algorithm just resets all the TaskFailed events, resetting the timer events after those events might just work fine.

Issue Analytics

State:
Created a year ago
Comments:8

Top GitHub Comments

1reaction

cgillumcommented, Mar 3, 2023

Adding @lilyjma, who’s helping manage our backlog.

We accept pull requests. However, one of our goals for improving this feature is to rewrite it so that it’s simpler and works for all backend types (Azure Storage, Netherite, MSSQL, etc.). The currently implementation only works for Azure Storage. There is a brief proposal here if you’re interested in taking a look and potentially contributing: https://github.com/Azure/durabletask/issues/731.

1reaction

cgillumcommented, Mar 3, 2023

No updates. This item unfortunately hasn’t made it high enough in the team’s backlog.

Top Results From Across the Web

Rewind not supported when using ' ...

Problem. Trying to rewind a Failed Orchestrator (due to an activity failing) but the rewind is failing with the reason.

Retrying Durable Function activity/orchastrator later

You should be able to query for failed instances, filter based on custom status and then call rewind on them.

Adamm Mover Error: Unload Rewind Retry/Failure! - VOX

Hi all, We have been using Veritas Backup Exec 9.1 Rev 4691 for the past year and a half with out any problems....

Rewind process using Admin api

Hi,. We have situation such that if process goes into particular fault we are suspending it .once the process is suspended using admin...

Retries with Azure Durable Functions

With Durable functions, you have support for retries – when calling activity or other orchestration function (sub orchestration). You can customize within ...