Apparent thread leak causing OOME in tservers
See original GitHub issueDescribe the bug While testing 2.1 in AWS we’ve observed a consistent pattern of OOME’s resulting in dead tservers. OOME occurs relatively quickly when the tservers are under sufficient query load, but still seem to occur under any amount of load given enough time.
The OOME’s present this stacktrace pretty consistently
[rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Unexpected throwable while invoking!
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.lang.Thread.start0(Native Method) ~[?:?]
at java.lang.Thread.start(Thread.java:798) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:346) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]
at org.apache.accumulo.core.util.threads.ThreadPools$3.schedule(ThreadPools.java:529) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.accumulo.tserver.session.SessionManager.removeIfNotAccessed(SessionManager.java:283) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.accumulo.tserver.ThriftClientHandler.continueMultiScan(ThriftClientHandler.java:581) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.accumulo.tserver.ThriftClientHandler.startMultiScan(ThriftClientHandler.java:532) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$1(TraceUtil.java:221) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at com.sun.proxy.$Proxy35.startMultiScan(Unknown Source) ~[?:?]
at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3038) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3017) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.15.0.jar:0.15.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.15.0.jar:0.15.0]
at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.15.0.jar:0.15.0]
at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.15.0.jar:0.15.0]
at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Versions (OS, Maven, Java, and others, as appropriate):
- Affected version(s) of this project: 2.1-SNAPSHOT. So far, I’ve been able to replicate the issue on the following commits:
918bb92
,2ca070b
,4b66b96
,9451dd0
- OS: CentOS 7.5
- Others: Hadoop 3.3.1, ZK 3.5.9, Java 11, Maven 3.6.3
To Reproduce
- Put Accumulo under reasonably heavy query load, and observe thread counts steadily increasing in tserver JVMs until OOME occurs
Expected behavior No OOME
Additional context
What appears to be happening is that we seem to be getting lots of TimeoutExceptions thrown in ThriftClientHandler
due to the hardcoded 1-second timeout being hit:
https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L581
Timeout duration is defined here: https://github.com/apache/accumulo/blob/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L168
As a result, we seem to get tons of new threads spun up in SessionManager.removeIfNotAccessed
:
https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283
…and those threads seem to linger in the JVM indefinitely. From jstacks that I’ve captured on tservers just before they die, there will typically be around 30,000+ threads spun up in the JVM when the OOME is about to strike. The amount of time that it takes to hit the OOME varies based on the amount of query load we’re putting on accumulo, but all our tservers seem to die this way eventually, given enough time.
FWIW, I’m currently running Accumulo with ThriftClientHandler.MAX_TIME_TO_WAIT_FOR_SCAN_RESULT_MILLIS
hardcoded to 60 seconds, and that seems to resolve this issue entirely. But that’s not the ideal solution here, I know. A configurable timeout would be more ideal; or perhaps there’s more going on here than meets the eye
Issue Analytics
- State:
- Created a year ago
- Comments:11 (9 by maintainers)
This is in #2593
I don’t see a reason why we shouldn’t make the timeout configurable.