Detect orchestrations that get into a bad state
See original GitHub issuePer the bug in this PR, we have gotten into states where an orchestration is scheduled but never actually starts. We would like to have better visibility and recoverability when this happens.
As a workaround for detection, we have created a timer trigger that uses the durableClient.ListInstancesAsync
method to find orchestrations with the OrchestrationRuntimeStatus.Pending
status that have a CreatedTime
that is older than 5 minutes. While I know this could falsely flag orchestrations that were scheduled but never executed because machines were down, it does give us some visibility that something is wrong before a customer complains.
Further exacerbating the problem, if the stalled orchestration is a singleton, there is no graceful recovery. Terminating and starting the orchestration again doesn’t do anything to get it back into a usable state.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9
Top GitHub Comments
I just discussed this with the rest of the team. Here’s what we’re thinking. In the short term, we think your current approach is reasonable for detecting the problem you’re seeing. Broadly speaking, that approach may catch some false positives but, for the specific issue you’re hitting, it should be enough.
For the longer term, we’re discussing adding a programmatic way of fetching “health statistics” about your Durable Functions storage provider. This would be a more scalable way of detecting “stuck” orchestrators as well as other problems. I suspect this will be a longer term ticket, but it’s a priority for me to increase the observability of Durable Functions so it’ll be on my radar and probably assigned to me.
My current plan is to begin by prioritizing this (https://github.com/Azure/azure-functions-durable-extension/issues/1609) ticket so we can terminate stuck orchestrators in the first place. Once that’s done, it makes sense to me that we’d improve on our ability to detect them. After all, it wouldn’t help us much to be able to detect orchestration failure if there’s no follow-up action we can take.
That’s all in addition to monitoring the hotfix for the specific problem you’re seeing, of course. So if your orchestrators are still getting into bad states, we’ll continue investigating them and providing patches wherever possible.
Thanks to you for your patience with us, this ticket took a bit to get into the right hands, but we’re on it now. I’ll ping the rest of my team members to see if there’s any hotfixes ,guidance, or longer term plans we can provide for improving the detection of these bad states.