Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ChangeFeedProcessor behavior when failing first batch

See original GitHub issue

Describe the bug We have a change feed processor with an onChangesDelegate that we expect to fail every so often with a transient error. In our tests, we simulate an error in the delegate in its first run, and unexpectedly, no further retries are attempted.

We use the change feed to perform actions based on any new/updated items in the db, and need at least once processing of items, so missing a batch breaks our system.

To Reproduce

        bool shouldThrow = true;

        async Task HandleChangesAsync(IReadOnlyCollection<JObject> changes, CancellationToken cancellationToken)
        {
            if (shouldThrow)
            {
                shouldThrow = false;
                throw new Exception("Transient error");
            }

            // Process changes - Never reaches here
        }

Starting the change feed:

                Container leaseContainer = client.GetContainer(options.DatabaseId, options.LeasesContainerId);
                changeFeedProcessor_ = client.GetContainer(options.DatabaseId, options.ContainerId)
                    .GetChangeFeedProcessorBuilder<JObject>(processorName: processorName_, HandleChangesAsync)
                        .WithInstanceName(instanceName_)
                        .WithLeaseContainer(leaseContainer)
                        .Build();
                
                await changeFeedProcessor_.StartAsync();

Expected behavior I would expect after the first failure, the batch would be retried.

Actual behavior The batch is not retried.

Environment summary SDK Version: 3.17.1 OS Version: Windows

Additional context

I found this comment on another issue that indicates that it may be intended behavior:

The only scenario where the batch might not be retried is if the batch that throws is the first ever (lease has no Continuation). Because when the host picks up the lease again to reprocess, it has no point in time to retry from.

_Originally posted by @ealsur in https://github.com/Azure/azure-cosmos-dotnet-v3/issues/405#issuecomment-500931938_

If this is the intended behavior, I think this needs to be called out more prominently in the documentation, unless I have missed it.

It seems like adding .WithStartTime(DateTime.MinValue.ToUniversalTime()) to the change feed processor initialization causes the failed batch to be retried.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

ealsurcommented, Jun 9, 2021

The docs have been updated with a note for this scenario, thanks for calling it out.

2reactions

ealsurcommented, Jun 9, 2021

The reason the first batch cannot be retried is because the state (leases) do not have any previous value to use.

CFP can be started either from “Now on” (default) or from some time (beginning or other).

When the processor is initialized, it creates the leases, these leases have no stored Continuation.

It then fetches the first batch of changes either from Now or from the time specified and sends them to processing.

If that batch fails, the lease cannot be updated (because it was a failure), so the lease state is whatever it was before the batch was fetched.

When the lease is retried, it has no Continuation (it never saved any), so CFP uses the configured time (Now, Beginning, some other Time) to read as if it was again the first time ever. That is why if you set some time in particular (beginning, or other), then you could see the changes again.

We can certainly add those details to the documentation, it has been the behavior since CFP V2. We don’t have any Continuation to store that could make the first batch be retriable.