[BUG] Multiple partition supervisor tasks running simultaneously if leaseAcquireInterval is smaller than leaseRenewInterval
See original GitHub issueDescribe the bug
PartitionControllerImpl.processPartition
discards the running partition supervisor task from PartitionControllerImpl.currentlyOwnedPartitions
map by a call to PartitionControllerImpl.removeLease(lease)
right after the task has been started thus making it possible to have multiple partition supervisor tasks running simultaneusly if ChangeFeedProcessorOptions.leaseAcquireInterval()
is smaller than ChangeFeedProcessorOptions.leaseRenewInterval()
.
That happens because the next execution of PartitionLoadBalancerImpl.run()
would be triggered before the lease renewal therefore making this particular lease available for us to take and start another partition supervisor task since it’s not saved in the PartitionControllerImpl.currentlyOwnedPartitions
map.
As a result the ChangeFeedProcessorOptions.feedPollDelay()
and ChangeFeedProcessorOptions.leaseRenewInterval()
settings are not followed or we could even end up with parallel consumption of the same change feed batch on a single instance of ChangeFeedProcessor
.
Exception or Stack Trace n/a
To Reproduce
- build a change feed processor via
ChangeFeedProcessor.builder
providingChangeFeedProcessorOptions
withleaseAcquireInterval
smaller thanleaseRenewInterval
:
ChangeFeedProcessorOptions options = new ChangeFeedProcessorOptions();
options
.leaseAcquireInterval(Duration.ofSeconds(12))
.leaseRenewInterval(Duration.ofSeconds(20));
ChangeFeedProcessor changeFeedProcessor =
ChangeFeedProcessor.Builder()
.hostName("my-hostname")
.feedContainer(feedContainer)
.leaseContainer(leaseContainer)
.options(options)
.handleChanges(docs -> {
// act on docs
})
.build();
- Start the processor:
changeFeedProcessor.start().block();
- After two executions of the sequence inside
PartitionLoadBalancerImpl.run()
there will be two partition supervisor tasks running simultaneously (which means two parallel executions ofPartitionProcessorImpl.run()
andLeaseRenewerImpl.run()
) thus the expected cadence of executions defined inChangeFeedProcessorOptions.feedPollDelay()
andChangeFeedProcessorOptions.leaseRenewInterval()
is violated.
Code Snippet
PartitionControllerImpl.removeLease(lease)
is executed right after we schedule the task in PartitionControllerImpl.processPartition
method so the lease gets removed from PartitionControllerImpl .currentlyOwnedPartitions
map right away even though the task is still running:
private WorkerTask processPartition(PartitionSupervisor partitionSupervisor, Lease lease) {
CancellationToken cancellationToken = this.shutdownCts.getToken();
WorkerTask partitionSupervisorTask = new WorkerTask(lease, () -> {
partitionSupervisor.run(cancellationToken)
.onErrorResume(throwable -> {
if (throwable instanceof PartitionSplitException) {
PartitionSplitException ex = (PartitionSplitException) throwable;
return this.handleSplit(lease, ex.getLastContinuation());
} else if (throwable instanceof TaskCancelledException) {
logger.debug("Partition {}: processing canceled.", lease.getLeaseToken());
} else {
logger.warn("Partition {}: processing failed.", lease.getLeaseToken(), throwable);
}
return Mono.empty();
})
.then(this.removeLease(lease)).subscribe();
});
this.scheduler.schedule(partitionSupervisorTask);
return partitionSupervisorTask;
}
private Mono<Void> removeLease(Lease lease) {
if (this.currentlyOwnedPartitions.get(lease.getLeaseToken()) != null) {
WorkerTask workerTask = this.currentlyOwnedPartitions.remove(lease.getLeaseToken());
if (workerTask.isRunning()) {
workerTask.interrupt();
}
logger.info("Partition {}: released.", lease.getLeaseToken());
}
// ....
}
Expected behavior Only a single instance of partition supervisor task is running for a given partition at a time.
Screenshots n/a
Setup
- OS: MacOS
- IDE : IntelliJ IDEA
- Version of the Library used:
azure-cosmos 3.7.4
Additional context Add any other context about the problem here.
Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
- Bug Description Added
- Repro Steps Added
- Setup information Added
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (5 by maintainers)
a fix for the issue is available as part of https://github.com/Azure/azure-sdk-for-java/pull/12999
@alxmglk - we will update this issue with the fix soon.