ChangeFeedProcessor behavior when failing first batch
See original GitHub issueDescribe the bug
We have a change feed processor with an onChangesDelegate
that we expect to fail every so often with a transient error. In our tests, we simulate an error in the delegate in its first run, and unexpectedly, no further retries are attempted.
We use the change feed to perform actions based on any new/updated items in the db, and need at least once processing of items, so missing a batch breaks our system.
To Reproduce
bool shouldThrow = true;
async Task HandleChangesAsync(IReadOnlyCollection<JObject> changes, CancellationToken cancellationToken)
{
if (shouldThrow)
{
shouldThrow = false;
throw new Exception("Transient error");
}
// Process changes - Never reaches here
}
Starting the change feed:
Container leaseContainer = client.GetContainer(options.DatabaseId, options.LeasesContainerId);
changeFeedProcessor_ = client.GetContainer(options.DatabaseId, options.ContainerId)
.GetChangeFeedProcessorBuilder<JObject>(processorName: processorName_, HandleChangesAsync)
.WithInstanceName(instanceName_)
.WithLeaseContainer(leaseContainer)
.Build();
await changeFeedProcessor_.StartAsync();
Expected behavior I would expect after the first failure, the batch would be retried.
Actual behavior The batch is not retried.
Environment summary SDK Version: 3.17.1 OS Version: Windows
Additional context
I found this comment on another issue that indicates that it may be intended behavior:
The only scenario where the batch might not be retried is if the batch that throws is the first ever (lease has no Continuation). Because when the host picks up the lease again to reprocess, it has no point in time to retry from.
_Originally posted by @ealsur in https://github.com/Azure/azure-cosmos-dotnet-v3/issues/405#issuecomment-500931938_
If this is the intended behavior, I think this needs to be called out more prominently in the documentation, unless I have missed it.
It seems like adding .WithStartTime(DateTime.MinValue.ToUniversalTime())
to the change feed processor initialization causes the failed batch to be retried.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top GitHub Comments
The docs have been updated with a note for this scenario, thanks for calling it out.
The reason the first batch cannot be retried is because the state (leases) do not have any previous value to use.
CFP can be started either from “Now on” (default) or from some time (beginning or other).
When the processor is initialized, it creates the leases, these leases have no stored Continuation.
It then fetches the first batch of changes either from Now or from the time specified and sends them to processing.
If that batch fails, the lease cannot be updated (because it was a failure), so the lease state is whatever it was before the batch was fetched.
When the lease is retried, it has no Continuation (it never saved any), so CFP uses the configured time (Now, Beginning, some other Time) to read as if it was again the first time ever. That is why if you set some time in particular (beginning, or other), then you could see the changes again.
We can certainly add those details to the documentation, it has been the behavior since CFP V2. We don’t have any Continuation to store that could make the first batch be retriable.