Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Controller increasing memory consumption and crash

See original GitHub issue

Problem description We executed a battery of longevity runs in a small cluster (3 nodes) under a light write/read workload (mediumScale) with Pravega 0.3.3. As can be observed in the figure below, the Controller process (dotted red line) is slowly consuming memory along time: mem_usage_per_process_time_series

Interestingly, these experiments do not consist of managing multiple Streams or heavily work with Transactions; they are mainly IO operations, so the Controller workload should be limited.

Problem location Controller.

Suggestions for an improvement Profile the memory consumption of Controller to detect a possible memory leak.

Issue Analytics

State:
Created 5 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

shiveshrcommented, Nov 28, 2018

Update: curator doesnt have a problem as assumed earlier… The problem is with the ExponentialBackoffRetry policy we use for retries in curator client. The value we supply is 500ms as base and 10 retries. The logic for exponential retry in curator is as follows:

 long sleepMs = baseSleepTimeMs * Math.max(1, random.nextInt(1 << (retryCount + 1)));

So in worst case this sleepMs across 10 retries would be: 500 * 2 + 500 * 4 + 500 * 8 + … + 500 * 1024 ~= 2048 * 500ms ~= 1000 seconds.

The moment we reduce our input parameters for retry we get the curator calling the callback method and grpc call completing.

1reaction

shiveshrcommented, Nov 27, 2018

the reason for controller service to take a very long time to shutdown is following: we use grpc.shutdown which waits for all ongoing grpc requests to complete before shutting down grpc service.

There is an ongoing grpc call, in this case updateStream which hasnt completed. The reason for its failure to complete is on the curator though.

following is the pattern we use for making zk calls in the store:

grpcService -> store -> getData(path) {
    curatorClient.getData().inBackground(callback, executor).forPath(path)
}

If zk session has expired, the curator sends interrupt to the background work. this doesnt result in the callback being invoked.

If i remove the asynchronous curator client call with a synchronous call then the curator immediately throws IllegalStateException and our future is completed and grpc returns failure to the caller.