question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Direct memory leaks in ledger rollover or when Bookkeeper is overloaded

See original GitHub issue

Describe the bug

Broker (tested with 2.6.3, 2.7.2 and master branch) runs out of direct memory under certain conditions. The Netty leak detector shows leaks (leak detector enabled with -Dpulsar.allocator.leak_detection=Advanced -Dio.netty.leakDetectionLevel=advanced).

11:53:26.531 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:26.538 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:26.544 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.894 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.898 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.901 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.902 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.903 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.904 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:51.905 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
11:53:56.585 [pulsar-io-24-2] ERROR org.apache.pulsar.PulsarBrokerStarter - -- Shutting down - Received OOM exception: failed to allocate 16777216 byte(s) of direct memory (used: 1056964615, max: 1073741824)

image

To Reproduce Steps to reproduce the behavior:

In first terminal window (leave terminal open):

# checkout repro repo
git clone https://github.com/lhotari/pulsar-issue-6054-repro
cd pulsar-issue-6054-repro

# deployment to minikube (or similar local environment) with enough RAM (8GB?)
helm repo add apache https://pulsar.apache.org/charts
helm repo update
./helm-deployment/redeploy_1node_cluster.sh

# forward ports 8080 and 6650 to deployed environment
# wait until deployment is completed, then run
./helm-deployment/pulsar_port_forward.sh

In each of the following terminal windows, configure pulsar-admin / pulsar-perf connectivity to the deployed environment

# set pulsar-client / pulsar-admin / pulsar-perf settings
export PULSAR_CLIENT_CONF=$PWD/helm-deployment/client.conf

# before continuing, ensure that pulsar-admin and pulsar-perf are on the PATH
# for example
export PATH="$HOME/apache-pulsar-2.7.2/bin:$PATH"

Deploying the functions used in the repro

./deploy_function_chain.sh

Open 2 terminal windows with PULSAR_CLIENT_CONF and the PATH with pulsar-perf.

Run producer:

./produce.sh

Run consumer

./consume.sh

In one terminal window, grep the logs

kubectl -n pulsar-testenv logs -f pod/pulsar-testenv-deployment-broker-0 |egrep 'LEAK|OOM'

Expected behavior

Direct memory shouldn’t be leaked, even when the system is under load. Netty buffers should be released before discarding.

Additional context

The broker is intentionally using special configuration that makes it easier to reproduce the issue.

broker:
  configData:
    backlogQuotaDefaultRetentionPolicy: producer_exception
    bookkeeperEnableStickyReads: 'true'
    brokerDeduplicationEnabled: 'false'
    brokerDeleteInactiveTopicsEnabled: 'false'
    exposeConsumerLevelMetricsInPrometheus: 'false'
    exposeTopicLevelMetricsInPrometheus: 'true'
    managedLedgerAddEntryTimeoutSeconds: '120'
    managedLedgerMaxEntriesPerLedger: '10'
    managedLedgerMaxLedgerRolloverTimeMinutes: '2'
    managedLedgerMinLedgerRolloverTimeMinutes: '0'
    managedLedgerReadEntryTimeoutSeconds: '120'
    topicFencingTimeoutSeconds: '5'
    maxUnackedMessagesPerConsumer: '0'
    maxUnackedMessagesPerSubscription: '0'   

the main configuration setting to trigger the problem is managedLedgerMaxEntriesPerLedger=10. It’s not a value to use in production configurations. The current default for managedLedgerMaxEntriesPerLedger is 50000. However when using a really low value, it can stress test the ledger rollover functionality in the broker. This reveals that there is some execution path where Netty buffers are discarded without releasing them. This shouldn’t happen even when the broker is overloaded. The expectation is to first fix that issue.

This test scenario shows also another problem. Sometimes the broker gets into a state where the direct memory doesn’t run out and instead the random1-in topic “locks up”. The stats-internal shows the state of LedgerClosing for some partitions. In this state, producing gets blocked. This issue might be related to #6054 .

The direct memory leak issue might be related to #9562 or #10136.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
lhotaricommented, May 31, 2021

This is the behavior with the fixes in #10755

image

Direct memory no more grows out of bounds.

0reactions
lhotaricommented, May 31, 2021

I added one more commit to #10755 , f176c15631959456d5edfe9526766dadb199e46a . It seems to fix the memory leak, at least in the repro case. However the consuming message throughput drops very low for some reason, but that must be a different problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [pulsar] lhotari commented on issue #10738: Direct memory ...
[GitHub] [pulsar] lhotari commented on issue #10738: Direct memory leaks in ledger rollover or when Bookkeeper is overloaded · 2021-05-31 Thread GitBox.
Read more >
Memory Leaks and Memory Leaks - ACCU.org
A memory leak is memory which has not been deallocated at the program exit. It is fairly obvious that according to this definition,...
Read more >
CompTIA Security+ Review Guide Exam SY0–601 (z-lib.org)
The organization is based directly on that provided by CompTIA in its ... Memory leak A memory leak occurs when a program fails...
Read more >
US20030120822A1 - Isolated control plane addressing - Google ...
The present invention provides a method and apparatus for addressing distributed network device processors coupled together through an isolated control ...
Read more >
Allowing alternative Secret output formats (e.g. single .pem file priv ...
... cert-manager rotates certs mongo can't consume the same directly and even if ... Direct memory leaks in ledger rollover or when Bookkeeper...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found