question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Durable Entity State Seems to be Getting Lost Possibly Cause Competing Hosts Saving State During Shutdown / DrainMode enabled

See original GitHub issue

Description

We have a durable entity that thottles concurrent delivery of incoming messages from an event hub to a target. It does this with an EH binding putting Signaling the entity with an Enque, and then an eternal orchestration Once every couple of mins Calls() the process method n the entity which drains the queue.

Once in a while…messages seems to disappear from the queue. EG we have 980 messages for receiving the event message from hub, 980 messages about “EntitySignal:Queue” but when our process method kicks off it only sees 808 messages in the queue, and we only see 808 processing messages.

~Nothing is standing out in the logs~, and we’ve been unable to recreate this problem anywhere but prod 😦

update: found some unexpected behaviour in log, see first comment

looking for any advice/things to check/way to verify somethings going wrong with the entity state (probably because we are doing something wrong)

public class ThrottleSender
{
    //it's a singleton so we will always use it a single Id
    public const string InstanceId = "7399a8cd-878d-48a2-a37c-684210bcb902";
    public static readonly EntityId SingletonEntityId = new EntityId(
        nameof(ThrottleSender),
        InstanceId
    );


    [FunctionName(nameof(ThrottleSender))]
    public static Task Run(
        [EntityTrigger] IDurableEntityContext ctx
    ) => ctx.DispatchAsync<ThrottleSender>();


    [JsonProperty]
    private Queue<EnqueuedCommand> queuedCommands = new Queue<EnqueuedCommand>();

    [JsonProperty]
    private List<FailedCommand> failedCommands = new List<FailedCommand>();

    [JsonProperty]
    private List<SuccessfulCommand> recentlySuccessfulCommands = new List<SuccessfulCommand>();


    [JsonIgnore]
    private readonly ServiceClient serviceClient;

    [JsonIgnore]
    private readonly ILogger<ThrottleSender> log;

    [JsonIgnore]
    private readonly TelemetryClientWrapper telemetryClient;

    public ThrottleSender(
        ServiceClient serviceClient,
        ILogger<ThrottleSender> log,
        TelemetryClientWrapper telemetryClient
    )
    {
        this.serviceClient = serviceClient;
        this.log = log;
        this.telemetryClient = telemetryClient;
    }


    public Task<ThrottleSender> Queue(
        DoStuffCommand command
    )
    {
        queuedCommands.Enqueue(
            new EnqueuedCommand(
                Guid.NewGuid(),
                DateTimeOffset.UtcNow,
                command
            )   
        );
        return Task.FromResult(this);
    }

    public async Task<ThrottleSender> ProcessQueuedCommands(
    )
    {

        // reset as we want to track this per run
        requestCount = 0;

        var startTime = DateTimeOffset.UtcNow;
       //stay under consumption plan 10 mins
        var endTime = startTime + TimeSpan.FromMinutes(5);

        var enqueuedCommandsBeforeMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.QUEUED_COMMANDS,
            Sum = queuedCommands.Count(),
            Timestamp = startTime,
        };

        var failedCommandsBeforeMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.FAILED_COMMMANDS_RETAINED,
            Sum = failedCommands.Count(),
            Timestamp = startTime,
        };

        var successCommandsBeforeMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.SUCCESSFUL_COMMANDS_RETAINED,
            Sum = recentlySuccessfulCommands.Count(),
            Timestamp = startTime,
        };

        telemetryClient.TrackMetric(enqueuedCommandsBeforeMetric);
        telemetryClient.TrackMetric(failedCommandsBeforeMetric);
        telemetryClient.TrackMetric(successCommandsBeforeMetric);

        //TODO: is this needed?
        telemetryClient.Flush();

        var sequentialFailures = 0;

        var inFlightTasks = new List<Task<DoStuffCommandResult>>();
        var inFlightCommands = new List<EnqueuedCommand>();
        var maxConcurrency = 5;

        // run loop while we have time AND (commandsToAdd OR processesToFinish)
        while (DateTimeOffset.UtcNow < endTime && (queuedCommands.Count > 0 || inFlightCommands.Count > 0)) 
        {
           //fill up with concurrent tasks
            while (DateTimeOffset.UtcNow < endTime && queuedCommands.Count > 0 && inFlightTasks.Count < maxConcurrency)
            {
                var c = queuedCommands.Dequeue();

                inFlightTasks.Add(serviceClient.Execute(c));
                inFlightCommands.Add(c);
            }

            var finishedTask = await Task.WhenAny(inFlightTasks);

            inFlightTasks.Remove(finishedTask);

            var result = await finishedTask;

            // remove the command whether or not it succeeded
            inFlightCommands.Remove(result.Command);

            if(result.Exception != null)
            {
                failedCommands.Add(
                    new FailedCommand(
                        result.Command.Id,
                        result.Command.Enqueued,
                        result.Command.Command,
                        DateTimeOffset.UtcNow,
                        result.Exception
                    )
                );

                log.LogError(
                    Troubleshooting.EventIDs.REQUEST_FAILURE,
                    result.Exception,
                    "executing command failed"
                );

                sequentialFailures++;

                if (sequentialFailures > maxSequentialFailures)
                {
                    // TODO: do we need to reset this? don't want the entity to sit around in dirty state
                    sequentialFailures = 0;

                    throw new Exception(
                        "ProcessQueuedCommands sequential failure limit reached"
                    );
                }

                continue;

            }

            // if we got here, task succeeded
            sequentialFailures = 0;

            log.LogInformation(
                    "processed command"
                );

            recentlySuccessfulCommands
            .Add(
                new SuccessfulCommand(
                    result.Command.Id,
                    result.Command.Enqueued,
                    result.Command.Command,
                    DateTimeOffset.UtcNow
                )
            );

        }

        //HACK: throw the rest back in the start of queue, we can redeliver these
        foreach (var c in inFlightCommands.Reverse())
        {
            queuedCommands = new Queue<EnqueuedCommand>(queuedCommands.Prepend(c));
        }

        // Hold onto recent success for 3 days
        for (var i = 0; i < recentlySuccessfulCommands.Count; i++)
        {
            if (recentlySuccessfulCommands[i].Succeeded < startTime.AddDays(-3))
            {
                recentlySuccessfulCommands.RemoveAt(i);
                break;
            }
        }

        // endtime doesn't always == time runout
        endTime = DateTimeOffset.UtcNow;

        var enqueuedCommandsAfterMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.QUEUED_COMMANDS,
            Sum = queuedCommands.Count(),
            Timestamp = endTime,
        };

        var failedCommandsAfterMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.FAILED_COMMMANDS_RETAINED,
            Sum = failedCommands.Count(),
            Timestamp = endTime,
        };

        var successCommandsAfterMetric = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.SUCCESSFUL_COMMANDS_RETAINED,
            Sum = recentlySuccessfulCommands.Count(),
            Timestamp = endTime,
        };

        var requestCountMetrics = new MetricTelemetry()
        {
            Name = Troubleshooting.TelemetryMetricNames.REQUEST_SENT,
            Sum = requestCount,
            Timestamp = endTime,
        };

        telemetryClient.TrackMetric(enqueuedCommandsAfterMetric);
        telemetryClient.TrackMetric(failedCommandsAfterMetric);
        telemetryClient.TrackMetric(successCommandsAfterMetric);
        telemetryClient.TrackMetric(requestCountMetrics);

        return await Task.FromResult(this);
    }

}

public record EnqueuedCommand(
    Guid Id,
    DateTimeOffset Enqueued,
    DoStuffCommand Command 
);

the thing filling the queue is DEAD simple

[FunctionName("QueueStuff")]
    public async Task QueueStuff(
        [EventHubTrigger("stuff", Connection = "con", ConsumerGroup = "$notdefault")] EventData message,
        [DurableClient] IDurableEntityClient durableEntityClient
    )
    {
        //bunch of value checks omitted

              await durableEntityClient.SignalEntityAsync(
                  ThottleSender.SingletonEntityId,
                  nameof(ThottleSender.Queue),
                  message.AsCommand()
              );
    }

and the eternal orchestration is

    [FunctionName(nameof(ProcessQueuedCommands))]
    public async Task ProcessQueuedCommands(
        [OrchestrationTrigger] IDurableOrchestrationContext context
    )
    {

        var replaySafelogger = context.CreateReplaySafeLogger(log);
        var loop = 0;
        
        while (loop++ < 100)
        {

            await context.CallEntityAsync(
                ThrottleSender.SingletonEntityId,
                nameof(ThrottleSender.ProcessQueuedCommands)
            );

            var completeTime = context.CurrentUtcDateTime;


            await context.CreateTimer(completeTime.AddSeconds(10), CancellationToken.None);
            
        }

        //after 100 loops we want to clear history, but keep executing see best practice at https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-eternal-orchestrations?tabs=csharp
        context.ContinueAsNew();
    }

Expected behavior

Expect if we see 980 SignalEntity:Queue messages we’d see 980 commands in the queue when process start is called

Actual behavior

only saw 808 commmands in the queue

Relevant source code snippets

see above

Known workarounds

none

App Details

  • Durable Functions extension version (e.g. v1.8.3): 2.7.2
  • Azure Functions runtime version (1.0 or 2.0): v4 .net 6 in process
  • Programming language used: c#

Screenshots

If applicable, add screenshots to help explain your problem.

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

If you don’t want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it’s extremely difficult to look up information.

Issue Analytics

  • State:closed
  • Created 6 months ago
  • Reactions:2
  • Comments:19

github_iconTop GitHub Comments

2reactions
sebastianburckhardtcommented, Apr 20, 2023

I can confirm that scenario 2 is taking place in this window. I can also see that the inputs are large, which means the contents are stored in blobs. My current suspicion is that the e-tag protection may not have taken into account this “large properties spill into blobs” mechanism which could explain why the loser’s state ends up winning (because it overwrites the blob) even though its table storage access loses (because the e-tag does not match). I will keep investigating.

2reactions
drdamourcommented, Apr 10, 2023

ok spent a few more hours trolling the logs…there was a minute 8:49 AM MT were exactly 172 queue method were invoke…so that seemed like a bad coincidence. looking around those logs…this entity was processing things simultaniously on two hosts…as i understood it that is NOT supposed to happen.

during the execution of ProcessQueuedCommands these logs came thru image

so that host was being shut down

shortly after that the other host started the entity

image

but during that the first host (that was shutting down) was happy to keep running it’s activity…

image

until it finally exited, and the host quickly shut down

image

and presumably that shutting down host then serialized it’s state (which was now an empty queue) to storage…and all the state from the other host was wiped out.

so… (1) is this a bug in the runtime?
(2) should two hosts ever be working with the same entity instance id simultaniously? (3) are we doing something wrong in our code? like are we supposed to take a cancellation token and have our activity early exit because of the drain event? i don’t recall ever reading any docs around that…but it kinda makes sense… (4) any other advice?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Jobhost shutdown can cause in-flight executions to fail #1112
Durable Entity State Seems to be Getting Lost Possibly Cause Competing Hosts Saving State During Shutdown / DrainMode enabled #2437.
Read more >
Azure Durable Entity cannot be reached anymore , seems ...
I am using Azure Durable Functions and have this running in production. All goes well, but for one specific entity, it seems I...
Read more >
Why do I See "DrainMode mode enabled" in the logs and ...
The drain mode doesnt seem to have caused the functions to stop. I would suggest you to look into the Diagnose and solve...
Read more >
Introducing Durable Entities for Serverless State
Learn how durable entities provide explicit management of state in serverless applications and guarantee operations are safe to execute ...
Read more >
Durable Entities - Azure Durable Functions Succinctly Ebook
The function uses IDurableEntityContext to manage the input data from the caller, the data to return (if you need it), and the entity's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found