question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Durable function fails with `TaskCanceledException` and never gets retried

See original GitHub issue

We have recently upgraded from Net6 In-Process to Net7 Isolated functions and since this upgrade we are seeing intermittent TaskCanceledException across multiple functions doing a variety of different work so no common/consistent reason we can find. It does not appear to ever hit our code and the exception is happening before the function is invoked within the WorkerFunctionInvoker.

One example below but we can provide more:

  • Timestamp: 2023-03-28T09:20:19.6220000Z
  • Function App version: v4
  • Function App name: -
  • Function name(s) (as appropriate): -
  • Invocation ID: c7ea2e1e-1e12-4d0b-ab60-fe3b75a92467
  • Region: Central US

Additional Invocation Ids from last 24 hours on a sandbox we are running: a0a7c75d-e997-4361-aa3b-a4c5501d0ec7 c6b083bf-60ab-4cb1-9ab6-8ca5f883c388

Stack

System.Threading.Tasks.TaskCanceledException:
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.WorkerFunctionInvoker+<InvokeCore>d__9.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/Workers/WorkerFunctionInvoker.cs:96)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.FunctionInvokerBase+<Invoke>d__24.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/FunctionInvokerBase.cs:82)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Script.Description.FunctionGenerator+<Coerce>d__3`1.MoveNext (Microsoft.Azure.WebJobs.Script, Version=4.16.0.0, Culture=neutral, PublicKeyToken=null: /_/src/WebJobs.Script/Description/FunctionGenerator.cs:225)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2+<InvokeAsync>d__10.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:52)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.OutOfProcMiddleware+<>c__DisplayClass10_0+<<CallOrchestratorAsync>b__0>d.MoveNext (Microsoft.Azure.WebJobs.Extensions.DurableTask, Version=2.0.0.0, Culture=neutral, PublicKeyToken=014045d636e89289: D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\OutOfProcMiddleware.cs:124)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.TriggeredFunctionExecutor`1+<>c__DisplayClass7_0+<<TryExecuteAsync>b__0>d.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\TriggeredFunctionExecutor.cs:50)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<InvokeWithTimeoutAsync>d__33.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:581)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithWatchersAsync>d__32.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:527)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:306)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<ExecuteWithLoggingAsync>d__26.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:352)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor+<TryExecuteAsync>d__18.MoveNext (Microsoft.Azure.WebJobs.Host, Version=3.0.36.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35: C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:108)

Repro steps

We cannot reproduce this consistently, the same function will work without issue 100 times then for no reason we will get a TaskCancelledException.

Issue Analytics

  • State:closed
  • Created 6 months ago
  • Reactions:1
  • Comments:21 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
apringlecommented, Apr 25, 2023

Hi, we are upgrading two large durable function projects (5k+ orchestrations,150k+ function executions total daily) from in-process 3.1 to isolated 7.

We believe we are hitting this issue in our dev environments while testing and are investigating further. Happy to provide further info if needed. Thanks

3reactions
RichardBurns1982commented, Apr 25, 2023

(Still haven’t received the support case so will respond here for now)

Why is the host shutting down:

Scheduled shut down of worker instance or scaling; this leads to host shut down and potential cancellation of invocations (these intentionally by design never reach worker or user code). This is normal behaviour, there’s nothing out of place here.

Why are we “losing orchestration functions mid flow and the retry logic not isn’t working”:

TL;DR: Because DF manages ongoing invocations independently of the function host, and so when the host cancels an invocation and fails it (without sending it to the worker), it considers the invocation complete and does not retry it (even though that function invocation never actually executed).

  • The orchestrator function get trigger and executes, as part of that an activity function gets invoked.
  • We then get a shut down request and so we wait for this activity function to finish.
  • We wait for the activity invocation but the orchestrator function itself doesn’t have an active invocation so we don’t consider it active; durable functions manages its own notion of “ongoing invocation” that is separate from the host.
  • When the activity function finishes, we flag all invocations as complete and continue to shut down the host fully.
  • Then, the orchestrator gets triggered again (because the activity finished), however this invocation request happens after we have begun full shut down so it gets rejected - we cancel and don’t invoke worker or user code.

Screenshot 2023-04-24 at 1 52 30 PM

Next steps:

We are going to try to repro this issue and work with the durable functions team to determine if this is a bug on their side or if we need to design some host changes to help durable handle cancellations so they can retry these invocations later on a different host instance than the one that is shutting down.

Thanks for the detailed response!

For most of our durable functions this isn’t a problem as we are primarily using them to work down a persisted queue or generate denormalized data so we run them again without issue, there are a few exceptions where this isn’t true and causing us a problem but we can work around in the short term.

Regarding the TaskCancelledException my only concern is that these errors are appearing in AppInsights which is what led to us investigating these issues.

The two scenarios that are affecting us are:

  1. When we get a TaskCancelledException on a standard function or on the Durable Orchestration entry. These we can’t recover from as it never hits our code but we are seeing the errors in AppInsights, sounds like this isn’t an issue for standard functions as while we are seeing the error in AppInsights it sounds like they are retried by the host and do run. Not the same for Durable where the instance entry in table storage is being failed.
  2. When we get a TaskCancelledException on an activity, this is bubbling up to the Orchestration function and we can try to handle this error and retry the Activity but that would be a lot of manual code in every Orchestration function we have.

Ideally, if the TaskCancelledException is happening internally in the host code we’d prefer never to see it or have to handle it as when it occurs and is logged in AppInsights we investigate, triage, etc. I’d be interested to know if you think we should be factoring this exception scenario in and specifically coding for it if it happens on an Activity or do you think this is an error we should never be seeing in AppInsights or handling ourselves.

Many thanks

Richard

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling errors in Durable Functions (Azure Functions)
Any exception that is thrown in an activity function is marshaled back to the orchestrator function and thrown as a FunctionFailedException .
Read more >
Durable Functions: Get failed tasks in Fan out pattern
I'm using the fan-out pattern to execute a list of tasks that have a possibility of failing. So I'm using the retry strategy...
Read more >
Ways to resolve failures in durable functions
Say the function failed. It can be in any place: Debit, Credit or Refund or somewhere in between, or process crashed or code...
Read more >
Azure Durable Functions System.Threading.Tasks. ...
These are first-chance exceptions which come from the Azure Functions runtime and are caught by the Azure Functions runtime. I agree they are ......
Read more >
Retry activity without throwing exception · Issue #1185
I would like to be able to retry these with CallActivityWi. ... exception to cause the function to fail and for the orchestrator...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found