question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ChangeFeedProcessor behavior when failing first batch

See original GitHub issue

Describe the bug We have a change feed processor with an onChangesDelegate that we expect to fail every so often with a transient error. In our tests, we simulate an error in the delegate in its first run, and unexpectedly, no further retries are attempted.

We use the change feed to perform actions based on any new/updated items in the db, and need at least once processing of items, so missing a batch breaks our system.

To Reproduce

        bool shouldThrow = true;

        async Task HandleChangesAsync(IReadOnlyCollection<JObject> changes, CancellationToken cancellationToken)
        {
            if (shouldThrow)
            {
                shouldThrow = false;
                throw new Exception("Transient error");
            }

            // Process changes - Never reaches here
        }

Starting the change feed:

                Container leaseContainer = client.GetContainer(options.DatabaseId, options.LeasesContainerId);
                changeFeedProcessor_ = client.GetContainer(options.DatabaseId, options.ContainerId)
                    .GetChangeFeedProcessorBuilder<JObject>(processorName: processorName_, HandleChangesAsync)
                        .WithInstanceName(instanceName_)
                        .WithLeaseContainer(leaseContainer)
                        .Build();
                
                await changeFeedProcessor_.StartAsync();

Expected behavior I would expect after the first failure, the batch would be retried.

Actual behavior The batch is not retried.

Environment summary SDK Version: 3.17.1 OS Version: Windows

Additional context

I found this comment on another issue that indicates that it may be intended behavior:

The only scenario where the batch might not be retried is if the batch that throws is the first ever (lease has no Continuation). Because when the host picks up the lease again to reprocess, it has no point in time to retry from.

_Originally posted by @ealsur in https://github.com/Azure/azure-cosmos-dotnet-v3/issues/405#issuecomment-500931938_

If this is the intended behavior, I think this needs to be called out more prominently in the documentation, unless I have missed it.

It seems like adding .WithStartTime(DateTime.MinValue.ToUniversalTime()) to the change feed processor initialization causes the failed batch to be retried.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ealsurcommented, Jun 9, 2021

The docs have been updated with a note for this scenario, thanks for calling it out.

2reactions
ealsurcommented, Jun 9, 2021

The reason the first batch cannot be retried is because the state (leases) do not have any previous value to use.

CFP can be started either from “Now on” (default) or from some time (beginning or other).

When the processor is initialized, it creates the leases, these leases have no stored Continuation.

It then fetches the first batch of changes either from Now or from the time specified and sends them to processing.

If that batch fails, the lease cannot be updated (because it was a failure), so the lease state is whatever it was before the batch was fetched.

When the lease is retried, it has no Continuation (it never saved any), so CFP uses the configured time (Now, Beginning, some other Time) to read as if it was again the first time ever. That is why if you set some time in particular (beginning, or other), then you could see the changes again.

We can certainly add those details to the documentation, it has been the behavior since CFP V2. We don’t have any Continuation to store that could make the first batch be retriable.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Change feed processor in Azure Cosmos DB
In only one scenario, a batch of changes is not retried. If the failure happens on the first ever delegate execution, the lease...
Read more >
ChangeFeedProcessorBuilder checkpointing after ...
Adding from comments: The only scenario where the batch might not be retried is if the batch that throws is the first ever...
Read more >
Exception in ChangeFeedProcessor is swallowed #405
An exception inside the delegate does not stop the Processor, it simply fails the current batch and retries it.
Read more >
On.NET Episode: Streaming and Batching with the Cosmos ...
NET. This time, his covers some interesting features like the estimator API, the change feed processor, running bulk operations and also ...
Read more >
Azure Cosmos DB : Change Feed processor for Java
I came across Azure's Change feed processor pattern which has been ... node fails, then the entire batch fails if the node processing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found