Run stress master bench CreateDir operation report exception
See original GitHub issueAlluxio Version: master last version
Describe the bug Run stress master bench get failure, reports as follow:
Failed after 1 attempts: alluxio.exception.status.CancelledException: Thread interrupted", "Failed after 1 attempts: alluxio.exception.status.CancelledException: Thread interrupted"
master log as follow:
2022-03-25 16:55:21,557 INFO SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_442711
2022-03-25 16:55:21,984 WARN ForkJoinPoolHelper - Failed to compensate rpc pool. Consider increasing thread pool size.
java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
at alluxio.concurrent.jsr.ForkJoinPool.tryCompensate(ForkJoinPool.java:1320)
at alluxio.concurrent.jsr.ForkJoinPool.managedBlock(ForkJoinPool.java:1002)
at alluxio.concurrent.ForkJoinPoolHelper.safeManagedBlock(ForkJoinPoolHelper.java:41)
at alluxio.master.journal.AsyncJournalWriter$FlushTicket.waitCompleted(AsyncJournalWriter.java:96)
at alluxio.master.journal.AsyncJournalWriter.flush(AsyncJournalWriter.java:381)
at alluxio.master.journal.MasterJournalContext.waitForJournalFlush(MasterJournalContext.java:78)
at alluxio.master.journal.MasterJournalContext.close(MasterJournalContext.java:107)
at alluxio.master.journal.StateChangeJournalContext.close(StateChangeJournalContext.java:53)
at alluxio.master.file.RpcContext.closeQuietly(RpcContext.java:146)
at alluxio.master.file.RpcContext.close(RpcContext.java:134)
at alluxio.master.file.DefaultFileSystemMaster.createDirectory(DefaultFileSystemMaster.java:2242)
at alluxio.master.file.FileSystemMasterClientServiceHandler.lambda$createDirectory$3(FileSystemMasterClientServiceHandler.java:171)
at alluxio.RpcUtils.callAndReturn(RpcUtils.java:121)
at alluxio.RpcUtils.call(RpcUtils.java:83)
at alluxio.RpcUtils.call(RpcUtils.java:58)
at alluxio.master.file.FileSystemMasterClientServiceHandler.createDirectory(FileSystemMasterClientServiceHandler.java:169)
at alluxio.grpc.FileSystemMasterClientServiceGrpc$MethodHandlers.invoke(FileSystemMasterClientServiceGrpc.java:2368)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at alluxio.security.authentication.ClientIpAddressInjector$1.onHalfClose(ClientIpAddressInjector.java:57)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at alluxio.security.authentication.AuthenticatedUserInjector$1.onHalfClose(AuthenticatedUserInjector.java:67)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at alluxio.concurrent.jsr.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1378)
at alluxio.concurrent.jsr.ForkJoinTask.doExec(ForkJoinTask.java:609)
at alluxio.concurrent.jsr.ForkJoinPool.runWorker(ForkJoinPool.java:1356)
at alluxio.concurrent.jsr.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:131)
2022-03-25 16:55:22,951 INFO SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolling segment log-442711_442838 to index:442838
2022-03-25 16:55:22,951 INFO SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_442711 to /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_442711-442838
job_worker.log as follow:
2022-03-25 15:05:20,849 WARN meta.MetaMasterConfigClient (AbstractClient.java:retryRPC) - GetConfigHash() returned clusterConfigHash: "c723a97eaf6f3620a720daa99c3dfed8"
pathConfigHash: "d41d8cd98f00b204e9800998ecf8427e"
in 12212 ms (>=10000 ms)
2022-03-25 15:16:19,908 INFO task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskCompletion) - Task 26 for job 1648191157925 completed.
2022-03-25 16:35:38,100 INFO command.CommandHandlingExecutor (CommandHandlingExecutor.java:run) - Received run task 71 for job 1648191157926 on worker 1648191147994
2022-03-25 16:35:38,100 INFO task.TaskExecutorManager (TaskExecutorManager.java:executeTask) - Task 71 for job 1648191157926 received
2022-03-25 16:35:38,101 INFO task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskRunning) - Task 71 for job 1648191157926 started
2022-03-25 16:55:48,506 INFO task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskCompletion) - Task 71 for job 1648191157926 completed.
To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)
Expected behavior A clear and concise description of what you expected to happen.
Urgency Describe the impact and urgency of the bug.
Are you planning to fix it Please indicate if you are already working on a PR.
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Uses of Class alluxio.exception.status.AlluxioStatusException
Exception indicating that the caller does not have permission to execute the specified operation. class, ResourceExhaustedException.
Read more >SFS 3 - SPEC.org
agreement, all results publicly disclosed must adhere to these Run and Reporting Rules. This document also includes the background and design of the...
Read more >OFA-IWG Interoperability Test Plan Release 1.49 - UNH-IOL
Chris Hutchins updated RDMA Interop and RDMA Stress ... Kill the IB master SM while test is running and check that it completes...
Read more >interp - man pages section 1: User Commands
The creating interpreter is called the master and the new interpreter is called a slave. A master can create any number of slaves, ......
Read more >SES 7.1 | Troubleshooting Guide - SUSE Documentation
This guide takes you through various common problems when running SUSE Enterprise ... 1 Reporting software problems; 2 Troubleshooting logging and debugging.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thx @HelloHorizon @dbw9580. I found the problem. First of all, this test case runs for long, should add --bench-timeout as more than 200m. Then jvm option should set young generation for a large mem, otherwise gc will kill kill the container. jvmOptions: - “-server” - “-Xmx96g” - “-Xms96g” - “-Xmn10g” - “-XX:SurvivorRatio=6” - “-XX:ParallelGCThreads=40” - “-XX:MaxDirectMemorySize=4g” - “-XX:+UseG1GC” - “-XX:G1HeapRegionSize=32m” - “-XX:MaxGCPauseMillis=200” - “-XX:MetaspaceSize=2g” - “-XX:MaxMetaspaceSize=2g” - “-XX:+DisableExplicitGC” - “-XX:MaxTenuringThreshold=15” - “-Xloggc:/var/alluxio-share-dir/gc_%p.log.log” - “-verbose:gc” - “-XX:CMSInitiatingOccupancyFraction=70” - “-XX:+PrintGCDetails” - “-XX:ErrorFile=/var/alluxio-share-dir/java_error_%p.log” - “-XX:HeapDumpPath=/var/alluxio-share-dir/core_%p.log” - “-XX:+PrintClassHistogram” - “-XX:+PrintGCDetails”
@lucaspeng12138 Since you see threads are interrupted, it’s probably because of the
--bench-timeout
which defaults to 20min then kills test threads. Try set this to a larger value https://github.com/Alluxio/alluxio/blob/21560be53f4306ddc077a7be9af8e5c18f255a1c/stress/shell/src/main/java/alluxio/stress/cli/StressMasterBench.java#L292