question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apparent thread leak causing OOME in tservers

See original GitHub issue

Describe the bug While testing 2.1 in AWS we’ve observed a consistent pattern of OOME’s resulting in dead tservers. OOME occurs relatively quickly when the tservers are under sufficient query load, but still seem to occur under any amount of load given enough time.

The OOME’s present this stacktrace pretty consistently

[rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Unexpected throwable while invoking!
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.lang.Thread.start0(Native Method) ~[?:?]
        at java.lang.Thread.start(Thread.java:798) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:346) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]
        at org.apache.accumulo.core.util.threads.ThreadPools$3.schedule(ThreadPools.java:529) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.session.SessionManager.removeIfNotAccessed(SessionManager.java:283) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.ThriftClientHandler.continueMultiScan(ThriftClientHandler.java:581) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.ThriftClientHandler.startMultiScan(ThriftClientHandler.java:532) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$1(TraceUtil.java:221) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at com.sun.proxy.$Proxy35.startMultiScan(Unknown Source) ~[?:?]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3038) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3017) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]

Versions (OS, Maven, Java, and others, as appropriate):

  • Affected version(s) of this project: 2.1-SNAPSHOT. So far, I’ve been able to replicate the issue on the following commits: 918bb92 , 2ca070b, 4b66b96, 9451dd0
  • OS: CentOS 7.5
  • Others: Hadoop 3.3.1, ZK 3.5.9, Java 11, Maven 3.6.3

To Reproduce

  1. Put Accumulo under reasonably heavy query load, and observe thread counts steadily increasing in tserver JVMs until OOME occurs

Expected behavior No OOME

Additional context What appears to be happening is that we seem to be getting lots of TimeoutExceptions thrown in ThriftClientHandler due to the hardcoded 1-second timeout being hit: https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L581

Timeout duration is defined here: https://github.com/apache/accumulo/blob/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L168

As a result, we seem to get tons of new threads spun up in SessionManager.removeIfNotAccessed: https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283

…and those threads seem to linger in the JVM indefinitely. From jstacks that I’ve captured on tservers just before they die, there will typically be around 30,000+ threads spun up in the JVM when the OOME is about to strike. The amount of time that it takes to hit the OOME varies based on the amount of query load we’re putting on accumulo, but all our tservers seem to die this way eventually, given enough time.

FWIW, I’m currently running Accumulo with ThriftClientHandler.MAX_TIME_TO_WAIT_FOR_SCAN_RESULT_MILLIS hardcoded to 60 seconds, and that seems to resolve this issue entirely. But that’s not the ideal solution here, I know. A configurable timeout would be more ideal; or perhaps there’s more going on here than meets the eye

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
dlmarioncommented, Mar 29, 2022

I think this should be calling ServerContext.getScheduledExecutor() instead of ThreadPools.getServerThreadPools().createGeneralScheduledExecutorService() so that it adds a Runnable to the shared general ScheduledThreadPoolExecutor instead of creating a new ThreadPoolExecutor.

This is in #2593

0reactions
dlmarioncommented, Mar 30, 2022

I don’t see a reason why we shouldn’t make the timeout configurable.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Managing an OutOfMemory Exception Caused by Thread ...
A thread leak is causing a memory shortage at the server, which will cause the JVM process to throw out an OOM error....
Read more >
Troubleshooting native OutOfMemory (OOM) error caused by ...
A thread leak can be diagnosed by analyzing one of the OOM generated javacores. 1) First obvious sign would be the size of...
Read more >
How can I determine the cause of an apparent memory leak in ...
Another thing to look at is Apache's fullstatus, see if you can find out what particular request is causing the memory leak.
Read more >
Troubleshooting file and thread leaks
Leaks of Java threads and Linux kernel file handles are one of the more rare types ... large numbers of open files will...
Read more >
What happens to the Java Thread during a memory leak in the ...
That is only one possible cause of memory leaks. ... leak or it re-uses the same objects in the heap since they are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found