Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Singleton Pattern] Instances are going to Pending and some remain in Running State.

See original GitHub issue

I need to execute messages with the same PersonId in Singleton fashion whereas I want to execute the ones with different PersonId in Parallel fashion. For this purpose, I am maintaining an ordered list of each PersonId in Redis Cache e.g. mycache:PersonId:1 will contain all messages for Person Id 1. If an instance with the same Person Id is already running then the azure queue message is ignored.

Testing went fine using azure-core-tools v2 on a local server. When deployed to Azure, sometimes messages are going to Pending state and some remain in Running forever. Sometimes, all instances go to pending state.

Why is this happening on Azure?

Given below is a similar sample of the code structure:

[FunctionName("PersonFunction_QueueStart")]
public static async Task QueueStart(
[QueueTrigger("person-queue", Connection = "connection-string")]PersonMessage personMessage,
[OrchestrationClient]DurableOrchestrationClient starter,
ILogger log)
{
    //log.LogInformation(..); log some attributes

    string instanceId = personMessage.PersonId;

    log.LogInformation($"Checking if instance with instance ID {instanceId} already exists.");
    var instance = await starter.GetStatusAsync(instanceId);
    if (instance == null ||
        instance.RuntimeStatus == OrchestrationRuntimeStatus.Completed ||
        instance.RuntimeStatus == OrchestrationRuntimeStatus.Failed)
    {
        log.LogInformation($"PersonFunction instance with instance ID {instanceId} does not exist.");
        await starter.StartNewAsync("PersonFunction", instanceId, personMessage);
        log.LogInformation($"Started orchestration with ID = {instanceId}.");
    }
    else
    {
        log.LogInformation($"PersonFunction instance with Instance ID {instanceId} already exists.");
    }
}


[FunctionName("PersonFunction")]
public static async Task RunOrchestrator(
    [OrchestrationTrigger] DurableOrchestrationContext context, ILogger log)
{
    log.LogInformation($"Executing PersonFunction orchestration with instance id {context.InstanceId}.");

    var input = context.GetInput<personMessage>();

    // Retrieve message from cache
    var personMessage = await context.CallActivityAsync<PersonMessage>(
        "PersonFunction_RetrieveMessage", input.PersonId);

    if (personMessage == null)
    {
        // Person_Monitor inserts a message in queue of Monitor durable function
        // which reads the cache after some seconds to see if any message is left.
        // If any message in cache is left it inserts a message in person-queue.
        await context.CallActivityAsync<Task>("Person_Monitor", input);
    }

    else
    {
        // Execute some stored procedure
        var result = await context.CallActivityAsync<PersonMessageResult>("PersonFunction_ExecuteProcedure",
            personMessage);

        // Left pop message from reddis
        await context.CallActivityAsync<personMessage>(
            "PersonFunction_LeftPopMessage", input.PersonId);

        context.ContinueAsNew(input);
    }

    log.LogInformation($"PersonFunction orchestration with instance id {context.InstanceId} executed successfully.");
}

Nuget Packages:

Microsoft.Azure.WebJobs.Extensions.Storage 3.0.3
Microsoft.NET.Sdk.Functions 1.0.24
Microsoft.NETCore.App 2.2.0
Newtonsoft.Json 11.0.2
StackExchange.Redis 2.0.519
System.Data.SqlClient 4.6.0

Issue Analytics

State:
Created 5 years ago
Comments:11

Top GitHub Comments

1reaction

cgillumcommented, Feb 21, 2019

Thanks @tehmas. I found your orchestration and confirm that it has gotten into a bad state. I see that you’re using the ContinueAsNew pattern, but I’m also detecting oddities. For example, it looks like multiple singletons are being created with the same name and around the same time. One such time is 2019-02-21 07:05:14.8686352 - three instances of your singleton were created concurrently.

It seems to me that you’re running into this issue: https://github.com/Azure/azure-functions-durable-extension/issues/612

Looking at your PersonFunction_QueueStart method, there is a race condition where two queue messages processed at the same time could cause two singletons to be created at the same time. This appears to be the source of the corruption. To fix it, you’ll need to use some form of locking (for example, a blob lease or the [Singleton] WebJobs attribute) to prevent this from happening. In the meantime, we’re looking into ways to make the StartNewAsync API safe for multi-threaded use.

0reactions

cgillumcommented, Feb 26, 2019

Unfortunately I don’t have any insights or expertise on Redis, so I can’t comment on how that might be impacting your function app.

By “delay” do you mean the time taken by the execution of the function or the time taken for the function to start it’s execution?

I meant that starting at 14:17 I see a large gap in activity. Followed by new activity in a new (recycled) process. I broke it down further so you can see in more detail:

At around 2019-02-25 14:13:11.9229702 your activity function started running.
It continued running until around 2019-02-25 14:17:42.4035646 or later (at most 30 seconds later). I can tell because every 30 seconds since it started, the durable task framework is trying to renew it’s lock on the internal storage queue message.
At 2019-02-25 14:18:11.9345019 the Functions host recycled your process because your activity function exceeded the 5 minute timeout.
At 2019-02-25 14:22:58.3026359 the durable task framework picked up the message again and started running your activity function again. This time it completed very quickly. The time between the process recycle and the next attempt at completing this activity function can be explained as the 5 minute visibility timeout of the internal queue message.

My takeaway here is that your activity function somehow got into a hung state and therefore exceeded the 5-minute timeout. I can’t explain why it hung because that seems to be somewhere in your application logic. I think the next step for you would be to figure out why your code is occasionally hanging. It could be a Redis issue, or (more likely IMO) it could be a deadlock somewhere in your code or in the SDK you’re using. In either case, you may need to get a Redis specialist involved.

Based on this analysis, I don’t think there’s actually an issue with the Durable Functions extension, so I’ll go ahead and close this issue. Do let me know if you find something that makes you think otherwise and we can re-investigate.