question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Multiple partition supervisor tasks running simultaneously if leaseAcquireInterval is smaller than leaseRenewInterval

See original GitHub issue

Describe the bug PartitionControllerImpl.processPartition discards the running partition supervisor task from PartitionControllerImpl.currentlyOwnedPartitions map by a call to PartitionControllerImpl.removeLease(lease) right after the task has been started thus making it possible to have multiple partition supervisor tasks running simultaneusly if ChangeFeedProcessorOptions.leaseAcquireInterval() is smaller than ChangeFeedProcessorOptions.leaseRenewInterval().

That happens because the next execution of PartitionLoadBalancerImpl.run() would be triggered before the lease renewal therefore making this particular lease available for us to take and start another partition supervisor task since it’s not saved in the PartitionControllerImpl.currentlyOwnedPartitions map.

As a result the ChangeFeedProcessorOptions.feedPollDelay() and ChangeFeedProcessorOptions.leaseRenewInterval() settings are not followed or we could even end up with parallel consumption of the same change feed batch on a single instance of ChangeFeedProcessor.

Exception or Stack Trace n/a

To Reproduce

  1. build a change feed processor via ChangeFeedProcessor.builder providing ChangeFeedProcessorOptions with leaseAcquireIntervalsmaller than leaseRenewInterval:
ChangeFeedProcessorOptions options = new ChangeFeedProcessorOptions();

options
    .leaseAcquireInterval(Duration.ofSeconds(12))
    .leaseRenewInterval(Duration.ofSeconds(20));

ChangeFeedProcessor changeFeedProcessor =
       ChangeFeedProcessor.Builder()
                .hostName("my-hostname")
                .feedContainer(feedContainer)
                .leaseContainer(leaseContainer)
                .options(options)
                .handleChanges(docs -> {
                    // act on docs
                })
                .build();
  1. Start the processor:
changeFeedProcessor.start().block();
  1. After two executions of the sequence inside PartitionLoadBalancerImpl.run() there will be two partition supervisor tasks running simultaneously (which means two parallel executions of PartitionProcessorImpl.run() and LeaseRenewerImpl.run()) thus the expected cadence of executions defined in ChangeFeedProcessorOptions.feedPollDelay() and ChangeFeedProcessorOptions.leaseRenewInterval() is violated.

Code Snippet PartitionControllerImpl.removeLease(lease) is executed right after we schedule the task in PartitionControllerImpl.processPartition method so the lease gets removed from PartitionControllerImpl .currentlyOwnedPartitions map right away even though the task is still running:

    private WorkerTask processPartition(PartitionSupervisor partitionSupervisor, Lease lease) {
        CancellationToken cancellationToken = this.shutdownCts.getToken();

        WorkerTask partitionSupervisorTask = new WorkerTask(lease, () -> {
            partitionSupervisor.run(cancellationToken)
                .onErrorResume(throwable -> {
                    if (throwable instanceof PartitionSplitException) {
                        PartitionSplitException ex = (PartitionSplitException) throwable;
                        return this.handleSplit(lease, ex.getLastContinuation());
                    } else if (throwable instanceof TaskCancelledException) {
                        logger.debug("Partition {}: processing canceled.", lease.getLeaseToken());
                    } else {
                        logger.warn("Partition {}: processing failed.", lease.getLeaseToken(), throwable);
                    }

                    return Mono.empty();
                })
                .then(this.removeLease(lease)).subscribe();
        });

        this.scheduler.schedule(partitionSupervisorTask);

        return partitionSupervisorTask;
    }

    private Mono<Void> removeLease(Lease lease) {
        if (this.currentlyOwnedPartitions.get(lease.getLeaseToken()) != null) {
            WorkerTask workerTask = this.currentlyOwnedPartitions.remove(lease.getLeaseToken());

            if (workerTask.isRunning()) {
                workerTask.interrupt();
            }

            logger.info("Partition {}: released.", lease.getLeaseToken());
        }

        // ....
   }

Expected behavior Only a single instance of partition supervisor task is running for a given partition at a time.

Screenshots n/a

Setup

  • OS: MacOS
  • IDE : IntelliJ IDEA
  • Version of the Library used: azure-cosmos 3.7.4

Additional context Add any other context about the problem here.

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
milismsftcommented, Sep 10, 2020

a fix for the issue is available as part of https://github.com/Azure/azure-sdk-for-java/pull/12999

0reactions
kushagraThaparcommented, Sep 10, 2020

@alxmglk - we will update this issue with the fix soon.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure cosmos changefeed Processor options - Stack Overflow
I am not familiar with the Java CFP, but when the number of instances is higher than leases, load balancing a lease across...
Read more >
ChangeFeedProcessorOptions Class - Microsoft Learn
Gets the renew interval for all leases for partitions currently held by ChangeFeedProcessor instance. int, getMaxItemCount(). Gets the maximum number of items ...
Read more >
Azure Functions Overview | PDF | Microsoft Visual Studio
Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found