question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Detect orchestrations that get into a bad state

See original GitHub issue

Per the bug in this PR, we have gotten into states where an orchestration is scheduled but never actually starts. We would like to have better visibility and recoverability when this happens.

As a workaround for detection, we have created a timer trigger that uses the durableClient.ListInstancesAsync method to find orchestrations with the OrchestrationRuntimeStatus.Pending status that have a CreatedTime that is older than 5 minutes. While I know this could falsely flag orchestrations that were scheduled but never executed because machines were down, it does give us some visibility that something is wrong before a customer complains.

Further exacerbating the problem, if the stalled orchestration is a singleton, there is no graceful recovery. Terminating and starting the orchestration again doesn’t do anything to get it back into a usable state.

@davidmrdavid

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:9

github_iconTop GitHub Comments

1reaction
davidmrdavidcommented, Dec 11, 2020

I just discussed this with the rest of the team. Here’s what we’re thinking. In the short term, we think your current approach is reasonable for detecting the problem you’re seeing. Broadly speaking, that approach may catch some false positives but, for the specific issue you’re hitting, it should be enough.

For the longer term, we’re discussing adding a programmatic way of fetching “health statistics” about your Durable Functions storage provider. This would be a more scalable way of detecting “stuck” orchestrators as well as other problems. I suspect this will be a longer term ticket, but it’s a priority for me to increase the observability of Durable Functions so it’ll be on my radar and probably assigned to me.

My current plan is to begin by prioritizing this (https://github.com/Azure/azure-functions-durable-extension/issues/1609) ticket so we can terminate stuck orchestrators in the first place. Once that’s done, it makes sense to me that we’d improve on our ability to detect them. After all, it wouldn’t help us much to be able to detect orchestration failure if there’s no follow-up action we can take.

That’s all in addition to monitoring the hotfix for the specific problem you’re seeing, of course. So if your orchestrators are still getting into bad states, we’ll continue investigating them and providing patches wherever possible.

1reaction
davidmrdavidcommented, Dec 11, 2020

Thanks to you for your patience with us, this ticket took a bit to get into the right hands, but we’re on it now. I’ll ping the rest of my team members to see if there’s any hotfixes ,guidance, or longer term plans we can provide for improving the detection of these bad states.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Provide a termination strategy for stuck orchestrations #1609
Please describe. Orchestrators can get into a "bad state" when they receive messages out of order. When paired with the singleton pattern, it ......
Read more >
AADB2C90304: User journey went into a bad state. Claims ...
My first approch was to have an IDP Subjourney which based on a value read from an Azure Table Storage at the start...
Read more >
Handling errors in Durable Functions (Azure Functions)
Durable Function orchestrations are implemented in code and can use the programming language's built-in error-handling features.
Read more >
Data Orchestration Explained – and Why You Shouldn't DIY
Data pipeline orchestration is traditionally engineering-heavy, but a modern data stack can free your engineers for higher-value projects.
Read more >
Choreography vs Orchestration in the land of serverless
Orchestration and choreography don't have to be mutually exclusive. Whenever I'm introducing state changes inside a state machine (such as ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found