Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Detect orchestrations that get into a bad state

See original GitHub issue

Per the bug in this PR, we have gotten into states where an orchestration is scheduled but never actually starts. We would like to have better visibility and recoverability when this happens.

As a workaround for detection, we have created a timer trigger that uses the durableClient.ListInstancesAsync method to find orchestrations with the OrchestrationRuntimeStatus.Pending status that have a CreatedTime that is older than 5 minutes. While I know this could falsely flag orchestrations that were scheduled but never executed because machines were down, it does give us some visibility that something is wrong before a customer complains.

Further exacerbating the problem, if the stalled orchestration is a singleton, there is no graceful recovery. Terminating and starting the orchestration again doesn’t do anything to get it back into a usable state.

@davidmrdavid

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9

Top GitHub Comments

1reaction

davidmrdavidcommented, Dec 11, 2020

I just discussed this with the rest of the team. Here’s what we’re thinking. In the short term, we think your current approach is reasonable for detecting the problem you’re seeing. Broadly speaking, that approach may catch some false positives but, for the specific issue you’re hitting, it should be enough.

For the longer term, we’re discussing adding a programmatic way of fetching “health statistics” about your Durable Functions storage provider. This would be a more scalable way of detecting “stuck” orchestrators as well as other problems. I suspect this will be a longer term ticket, but it’s a priority for me to increase the observability of Durable Functions so it’ll be on my radar and probably assigned to me.

My current plan is to begin by prioritizing this (https://github.com/Azure/azure-functions-durable-extension/issues/1609) ticket so we can terminate stuck orchestrators in the first place. Once that’s done, it makes sense to me that we’d improve on our ability to detect them. After all, it wouldn’t help us much to be able to detect orchestration failure if there’s no follow-up action we can take.

That’s all in addition to monitoring the hotfix for the specific problem you’re seeing, of course. So if your orchestrators are still getting into bad states, we’ll continue investigating them and providing patches wherever possible.

1reaction

davidmrdavidcommented, Dec 11, 2020

Thanks to you for your patience with us, this ticket took a bit to get into the right hands, but we’re on it now. I’ll ping the rest of my team members to see if there’s any hotfixes ,guidance, or longer term plans we can provide for improving the detection of these bad states.