Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apparent thread leak causing OOME in tservers

See original GitHub issue

Describe the bug While testing 2.1 in AWS we’ve observed a consistent pattern of OOME’s resulting in dead tservers. OOME occurs relatively quickly when the tservers are under sufficient query load, but still seem to occur under any amount of load given enough time.

The OOME’s present this stacktrace pretty consistently

[rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Unexpected throwable while invoking!
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.lang.Thread.start0(Native Method) ~[?:?]
        at java.lang.Thread.start(Thread.java:798) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:346) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]
        at org.apache.accumulo.core.util.threads.ThreadPools$3.schedule(ThreadPools.java:529) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.session.SessionManager.removeIfNotAccessed(SessionManager.java:283) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.ThriftClientHandler.continueMultiScan(ThriftClientHandler.java:581) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.tserver.ThriftClientHandler.startMultiScan(ThriftClientHandler.java:532) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$1(TraceUtil.java:221) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at com.sun.proxy.$Proxy35.startMultiScan(Unknown Source) ~[?:?]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3038) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3017) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.15.0.jar:0.15.0]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]

Versions (OS, Maven, Java, and others, as appropriate):

Affected version(s) of this project: 2.1-SNAPSHOT. So far, I’ve been able to replicate the issue on the following commits: 918bb92 , 2ca070b, 4b66b96, 9451dd0
OS: CentOS 7.5
Others: Hadoop 3.3.1, ZK 3.5.9, Java 11, Maven 3.6.3

To Reproduce

Put Accumulo under reasonably heavy query load, and observe thread counts steadily increasing in tserver JVMs until OOME occurs

Expected behavior No OOME

Additional context What appears to be happening is that we seem to be getting lots of TimeoutExceptions thrown in ThriftClientHandler due to the hardcoded 1-second timeout being hit: https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L581

Timeout duration is defined here: https://github.com/apache/accumulo/blob/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L168

As a result, we seem to get tons of new threads spun up in SessionManager.removeIfNotAccessed: https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283

…and those threads seem to linger in the JVM indefinitely. From jstacks that I’ve captured on tservers just before they die, there will typically be around 30,000+ threads spun up in the JVM when the OOME is about to strike. The amount of time that it takes to hit the OOME varies based on the amount of query load we’re putting on accumulo, but all our tservers seem to die this way eventually, given enough time.

FWIW, I’m currently running Accumulo with ThriftClientHandler.MAX_TIME_TO_WAIT_FOR_SCAN_RESULT_MILLIS hardcoded to 60 seconds, and that seems to resolve this issue entirely. But that’s not the ideal solution here, I know. A configurable timeout would be more ideal; or perhaps there’s more going on here than meets the eye

Issue Analytics

State:
Created a year ago
Comments:11 (9 by maintainers)

Top GitHub Comments

1reaction

dlmarioncommented, Mar 29, 2022

I think this should be calling ServerContext.getScheduledExecutor() instead of ThreadPools.getServerThreadPools().createGeneralScheduledExecutorService() so that it adds a Runnable to the shared general ScheduledThreadPoolExecutor instead of creating a new ThreadPoolExecutor.

This is in #2593

0reactions

dlmarioncommented, Mar 30, 2022

I don’t see a reason why we shouldn’t make the timeout configurable.