JVM crashed when start job_worker progress
See original GitHub issueAlluxio Version: master branch, 2.3.0, 2.4.0
Describe the bug I compiled the alluxio locally with the system of centos 6 successfully. But when I try to start the job_worker progress, I got the following error and the JVM crashed.
2020-11-12 15:56:04,999 INFO network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available 2020-11-12 15:56:05,517 INFO metrics.MetricsSystem (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}. 2020-11-12 15:56:05,519 INFO metrics.MetricsHeartbeatContext (MetricsHeartbeatContext.java:addHeartbeat) - Created metrics heartbeat with ID app-8127555058117044977. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2020-11-12 15:56:05,547 INFO network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=100.76.19.7, rack=presto-ss-qe-presto-test) 2020-11-12 15:56:05,596 INFO util.log (Log.java:initialized) - Logging initialized @1076ms to org.eclipse.jetty.util.log.Slf4jLog 2020-11-12 15:56:05,725 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job worker. 2020-11-12 15:56:05,725 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Running under Java 1.8.0_252 2020-11-12 15:56:05,726 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service starting @ /0.0.0.0:30003 2020-11-12 15:56:05,727 INFO metrics.MetricsHeartbeatContext (MetricsHeartbeatContext.java:addHeartbeat) - Created metrics heartbeat with ID app-4950460193034851762. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property 2020-11-12 15:56:05,730 INFO server.Server (Server.java:doStart) - jetty-9.4.31.v20200723; built: 2020-07-23T17:57:36.812Z; git: 450ba27947e13e66baa8cd1ce7e85a4461cacc1d; jvm 1.8.0_252-b4 2020-11-12 15:56:05,756 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@7cbd9d24{/metrics/json,null,AVAILABLE} 2020-11-12 15:56:05,757 WARN security.SecurityHandler (ConstraintSecurityHandler.java:checkPathsWithUncoveredHttpMethods) - ServletContext@o.e.j.s.ServletContextHandler@50dfbc58{/,null,STARTING} has uncovered http methods for path: / 2020-11-12 15:56:09,586 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@50dfbc58{/,null,AVAILABLE} 2020-11-12 15:56:09,594 INFO server.AbstractConnector (AbstractConnector.java:doStart) - Started ServerConnector@4470fbd6{HTTP/1.1, (http/1.1)}{0.0.0.0:30003} 2020-11-12 15:56:09,595 INFO server.Server (Server.java:doStart) - Started @5075ms 2020-11-12 15:56:09,595 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service started @ /0.0.0.0:30003 2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Started Alluxio job worker with id 1605167752223 2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Alluxio job worker version 2.5.0-SNAPSHOT started. bindHost=/0.0.0.0:30001, connectHost=tdw-100-76-19-7:30001, rpcPort=30001, webPort=30003 2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Starting gRPC server on address tdw-100-76-19-7:30001 2020-11-12 15:56:09,689 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Started gRPC server on address tdw-100-76-19-7:30001 # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007ff6822993b8, pid=100388, tid=0x00007ff6001c1700 # # JRE version: OpenJDK Runtime Environment (8.0_252-b04) (build 1.8.0_252-b4) # Java VM: OpenJDK 64-Bit Server VM (25.252-b4 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [ld-linux-x86-64.so.2+0xb3b8] _dl_relocate_object+0x98 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try “ulimit -c unlimited” before starting Java again # # An error report file with more information is saved as: # /data/tdwadmin/tdwenv/panyliu/alluxio-2.5-tq-0.1.0-SNAPSHOT/bin/hs_err_pid100388.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug.
Here is the stack info in the detailed crash report file.
Stack: [0x00007f23423e8000,0x00007f23424e9000], sp=0x00007f23424e4de0, free space=1011k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [ld-linux-x86-64.so.2+0xab12] _dl_relocate_object+0xa2 C [ld-linux-x86-64.so.2+0x1315f] dl_open_worker+0x38f C [ld-linux-x86-64.so.2+0xe7b6] _dl_catch_error+0x66 C [libdl.so.2+0xf76] dlopen_doit+0x66
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0 j java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328 j java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+48 j java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+57 j java.lang.System.load(Ljava/lang/String;)V+7 j com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath()V+110 j com.sun.jna.Native.loadNativeDispatchLibrary()V+420 j com.sun.jna.Native.<clinit>()V+108 v ~StubRoutines::call_stub j oshi.jna.platform.linux.LinuxLibc.<clinit>()V+4 v ~StubRoutines::call_stub j oshi.hardware.platform.linux.LinuxCentralProcessor.getSystemLoadAverage(I)[D+24 j alluxio.worker.job.command.JobWorkerHealthReporter.compute()V+18 j alluxio.worker.job.command.CommandHandlingExecutor.heartbeat()V+4 j alluxio.heartbeat.HeartbeatThread.run()V+78 j java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4 J 2294 C1 java.util.concurrent.FutureTask.run()V (126 bytes) @ 0x00007f244966ea64 [0x00007f244966e800+0x264] j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub
Here is the gdb debug info(I am not familiar with it):
(gdb) info shared From To Syms Read Shared Object Library No linux-vdso.so.1 0x00007f7ad75fb060 0x00007f7ad75fc4f8 Yes /lib64/libonion.so 0x00007f7ad70c6950 0x00007f7ad70d30f8 Yes /lib64/libpthread.so.0 0x00007f7ad6eac410 0x00007f7ad6eb9778 Yes /data/tdwenv/TencentKona-8.0.3-262/bin/…/lib/amd64/jli/libjli.so 0x00007f7ad6ca6e10 0x00007f7ad6ca78e8 Yes /lib64/libdl.so.2 0x00007f7ad6918580 0x00007f7ad6a49594 Yes /lib64/libc.so.6 0x00007f7ad72dfae0 0x00007f7ad72f8950 Yes /lib64/ld-linux-x86-64.so.2 0x00007f7ad5aec870 0x00007f7ad63e9058 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so 0x00007f7ad55d4790 0x00007f7ad5641748 Yes /lib64/libm.so.6 0x00007f7ad53c92a0 0x00007f7ad53cc2d8 Yes /lib64/librt.so.1 0x00007f7ad51bb340 0x00007f7ad51c22b8 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libverify.so 0x00007f7ad4f9a5c0 0x00007f7ad4fadf78 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjava.so 0x00007f7ad4d738a0 0x00007f7ad4d84898 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libzip.so 0x00007f7aa433fe30 0x00007f7aa43470e8 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libnio.so 0x00007f7aa4124bf0 0x00007f7aa4134098 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libnet.so 0x00007f7a98289a70 0x00007f7a9828c498 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libmanagement.so No /tmp/libnetty_transport_native_epoll_x86_648487691960033771233.so 0x00007f7a69de8790 0x00007f7a69de8b98 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjaas_unix.so 0x00007f7a68594840 0x00007f7a685b27b8 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libsunec.so 0x00007f7a68377910 0x00007f7a68387f18 Yes /lib64/libgcc_s-4.4.6-20110824.so.1 No /home/panyliu/.cache/JNA/temp/jna3956775499404260402.tmp (): Shared library is missing debugging information. (gdb) bt #0 0x00007f7ad692bb15 in raise (sig=6) at …/nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f7ad692cf25 in abort () at abort.c:89 #2 0x00007f7ad6211735 in os::abort(bool) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #3 0x00007f7ad63b8ee3 in VMError::report_and_die() () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #4 0x00007f7ad6218242 in JVM_handle_linux_signal () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #5 0x00007f7ad620d4d3 in signalHandler(int, siginfo*, void*) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #6 <signal handler called> #7 _dl_relocate_object (scope=0x7f79b40231d8, reloc_mode=<value optimized out>, consider_profiling=0) at dl-reloc.c:238 #8 0x00007f7ad72f215f in dl_open_worker (a=<value optimized out>) at dl-open.c:416 #9 0x00007f7ad72ed7b6 in _dl_catch_error (objname=0x7f7a68cd4fd0, errstring=0x7f7a68cd4fc8, mallocedp=0x7f7a68cd4fdf, operate=0x7f7ad72f1dd0 <dl_open_worker>, args=0x7f7a68cd4f80) at dl-error.c:177 #10 0x00007f7ad72f191a in _dl_open (file=0x7f79b4022950 “/home/panyliu/.cache/JNA/temp/jna3956775499404260402.tmp”, mode=-2147483647, caller_dlopen=0x7f7ad6214d1d, nsid=-2, argc=19, argv=<value optimized out>, env=0x7fffb87486e8) at dl-open.c:650 #11 0x00007f7ad6ca6f76 in dlopen_doit (a=0x7f7a68cd51a0) at dlopen.c:66 #12 0x00007f7ad72ed7b6 in _dl_catch_error (objname=0x7f79b40011d0, errstring=0x7f79b40011d8, mallocedp=0x7f79b40011c8, operate=0x7f7ad6ca6f10 <dlopen_doit>, args=0x7f7a68cd51a0) at dl-error.c:177 #13 0x00007f7ad6ca72ec in _dlerror_run (operate=0x7f7ad6ca6f10 <dlopen_doit>, args=0x7f7a68cd51a0) at dlerror.c:163 #14 0x00007f7ad6ca6ef1 in __dlopen (file=<value optimized out>, mode=<value optimized out>) at dlopen.c:87 #15 0x00007f7ad6214d1d in os::dll_load(char const*, char*, int) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #16 0x00007f7ad600a173 in JVM_LoadLibrary () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so #17 0x00007f7ad4f9b7b8 in Java_java_lang_ClassLoader_00024NativeLibrary_load () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjava.so #18 0x00007f7ac1018507 in ?? () #19 0x00000007099e9038 in ?? () #20 0x00007f7ac10080a1 in ?? () #21 0x00007f7a68cd5da8 in ?? () #22 0x00007f7ac10080a1 in ?? () #23 0x00007f7a68cd5d50 in ?? () #24 0x0000000000000000 in ?? () (gdb)
It seems related to the native method load. The jvm cannot find the .so file or somthing else thus got a signal from the linux kernal and then shutdown. I know litttle about the jna called of alluxio, so I don’t figure out the crash reason yet. Any suggestion is appreciate. By the way, this problem will not happen when using the community version, so it seems an issue related to the compiling env, but I am not sure.
To Reproduce
Compiling locally in centos 6.
exec following command:
./bin/alluxio-start.sh local
Expected behavior The job_worker progress starts successfully.
Urgency HIGH
Additional context When I update the OSHI version above 5.3.1, the problem solved.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
@apc999 We use jdk8
It appears that @liupan664021 has figured out the issue and made a fix. The commit is good so I just merged it.