question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Run stress master bench CreateDir operation report exception

See original GitHub issue

Alluxio Version: master last version

Describe the bug Run stress master bench get failure, reports as follow:

Failed after 1 attempts: alluxio.exception.status.CancelledException: Thread interrupted", "Failed after 1 attempts: alluxio.exception.status.CancelledException: Thread interrupted"

master log as follow:

2022-03-25 16:55:21,557 INFO  SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_442711
2022-03-25 16:55:21,984 WARN  ForkJoinPoolHelper - Failed to compensate rpc pool. Consider increasing thread pool size.
java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker
        at alluxio.concurrent.jsr.ForkJoinPool.tryCompensate(ForkJoinPool.java:1320)
        at alluxio.concurrent.jsr.ForkJoinPool.managedBlock(ForkJoinPool.java:1002)
        at alluxio.concurrent.ForkJoinPoolHelper.safeManagedBlock(ForkJoinPoolHelper.java:41)
        at alluxio.master.journal.AsyncJournalWriter$FlushTicket.waitCompleted(AsyncJournalWriter.java:96)
        at alluxio.master.journal.AsyncJournalWriter.flush(AsyncJournalWriter.java:381)
        at alluxio.master.journal.MasterJournalContext.waitForJournalFlush(MasterJournalContext.java:78)
        at alluxio.master.journal.MasterJournalContext.close(MasterJournalContext.java:107)
        at alluxio.master.journal.StateChangeJournalContext.close(StateChangeJournalContext.java:53)
        at alluxio.master.file.RpcContext.closeQuietly(RpcContext.java:146)
        at alluxio.master.file.RpcContext.close(RpcContext.java:134)
        at alluxio.master.file.DefaultFileSystemMaster.createDirectory(DefaultFileSystemMaster.java:2242)
        at alluxio.master.file.FileSystemMasterClientServiceHandler.lambda$createDirectory$3(FileSystemMasterClientServiceHandler.java:171)
        at alluxio.RpcUtils.callAndReturn(RpcUtils.java:121)
        at alluxio.RpcUtils.call(RpcUtils.java:83)
        at alluxio.RpcUtils.call(RpcUtils.java:58)
        at alluxio.master.file.FileSystemMasterClientServiceHandler.createDirectory(FileSystemMasterClientServiceHandler.java:169)
        at alluxio.grpc.FileSystemMasterClientServiceGrpc$MethodHandlers.invoke(FileSystemMasterClientServiceGrpc.java:2368)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at alluxio.security.authentication.ClientIpAddressInjector$1.onHalfClose(ClientIpAddressInjector.java:57)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at alluxio.security.authentication.AuthenticatedUserInjector$1.onHalfClose(AuthenticatedUserInjector.java:67)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at alluxio.concurrent.jsr.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1378)
        at alluxio.concurrent.jsr.ForkJoinTask.doExec(ForkJoinTask.java:609)
        at alluxio.concurrent.jsr.ForkJoinPool.runWorker(ForkJoinPool.java:1356)
        at alluxio.concurrent.jsr.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:131)
2022-03-25 16:55:22,951 INFO  SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolling segment log-442711_442838 to index:442838
2022-03-25 16:55:22,951 INFO  SegmentedRaftLogWorker - alluxio-lucas-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_442711 to /var/alluxio-share-dir/journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_442711-442838

job_worker.log as follow:

2022-03-25 15:05:20,849 WARN  meta.MetaMasterConfigClient (AbstractClient.java:retryRPC) - GetConfigHash() returned clusterConfigHash: "c723a97eaf6f3620a720daa99c3dfed8"
pathConfigHash: "d41d8cd98f00b204e9800998ecf8427e"
 in 12212 ms (>=10000 ms)
2022-03-25 15:16:19,908 INFO  task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskCompletion) - Task 26 for job 1648191157925 completed.
2022-03-25 16:35:38,100 INFO  command.CommandHandlingExecutor (CommandHandlingExecutor.java:run) - Received run task 71 for job 1648191157926 on worker 1648191147994
2022-03-25 16:35:38,100 INFO  task.TaskExecutorManager (TaskExecutorManager.java:executeTask) - Task 71 for job 1648191157926 received
2022-03-25 16:35:38,101 INFO  task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskRunning) - Task 71 for job 1648191157926 started
2022-03-25 16:55:48,506 INFO  task.TaskExecutorManager (TaskExecutorManager.java:notifyTaskCompletion) - Task 71 for job 1648191157926 completed.

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen.

Urgency Describe the impact and urgency of the bug.

Are you planning to fix it Please indicate if you are already working on a PR.

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
lucaspeng12138commented, Apr 20, 2022

Thx @HelloHorizon @dbw9580. I found the problem. First of all, this test case runs for long, should add --bench-timeout as more than 200m. Then jvm option should set young generation for a large mem, otherwise gc will kill kill the container. jvmOptions: - “-server” - “-Xmx96g” - “-Xms96g” - “-Xmn10g” - “-XX:SurvivorRatio=6” - “-XX:ParallelGCThreads=40” - “-XX:MaxDirectMemorySize=4g” - “-XX:+UseG1GC” - “-XX:G1HeapRegionSize=32m” - “-XX:MaxGCPauseMillis=200” - “-XX:MetaspaceSize=2g” - “-XX:MaxMetaspaceSize=2g” - “-XX:+DisableExplicitGC” - “-XX:MaxTenuringThreshold=15” - “-Xloggc:/var/alluxio-share-dir/gc_%p.log.log” - “-verbose:gc” - “-XX:CMSInitiatingOccupancyFraction=70” - “-XX:+PrintGCDetails” - “-XX:ErrorFile=/var/alluxio-share-dir/java_error_%p.log” - “-XX:HeapDumpPath=/var/alluxio-share-dir/core_%p.log” - “-XX:+PrintClassHistogram” - “-XX:+PrintGCDetails”

1reaction
jiacheliu3commented, Apr 6, 2022

@lucaspeng12138 Since you see threads are interrupted, it’s probably because of the --bench-timeout which defaults to 20min then kills test threads. Try set this to a larger value https://github.com/Alluxio/alluxio/blob/21560be53f4306ddc077a7be9af8e5c18f255a1c/stress/shell/src/main/java/alluxio/stress/cli/StressMasterBench.java#L292

Read more comments on GitHub >

github_iconTop Results From Across the Web

Uses of Class alluxio.exception.status.AlluxioStatusException
Exception indicating that the caller does not have permission to execute the specified operation. class, ResourceExhaustedException.
Read more >
SFS 3 - SPEC.org
agreement, all results publicly disclosed must adhere to these Run and Reporting Rules. This document also includes the background and design of the...
Read more >
OFA-IWG Interoperability Test Plan Release 1.49 - UNH-IOL
Chris Hutchins updated RDMA Interop and RDMA Stress ... Kill the IB master SM while test is running and check that it completes...
Read more >
interp - man pages section 1: User Commands
The creating interpreter is called the master and the new interpreter is called a slave. A master can create any number of slaves, ......
Read more >
SES 7.1 | Troubleshooting Guide - SUSE Documentation
This guide takes you through various common problems when running SUSE Enterprise ... 1 Reporting software problems; 2 Troubleshooting logging and debugging.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found