[BUG] Occasional segfaults during query parsing/execution with ucx environments
See original GitHub issueWhat happened:
Using this issue a place to document segfaults a few folks were seeing when using dask-sql along with dask-cuda in an environment that contains ucx and ucx-py. Typically the errors as segfaults caused by SIGSEGV
and the import order of libraries seems to impact whether the segfault happens or not.
What you expected to happen: Ideally no segfaults.
Here are some of the reproducers I have:
Minimal Complete Verifiable Example:
from dask_sql import Context
import dask_cuda # internally import ucp if present
c = Context()
query = "SELECT cast(2002-01-01 as TIMESTAMP)"
print(c.explain(query)) # Fails most of the time (Non-deterministic)
Here’s an example that doesn’t use dask-sql python or dask-cuda but uses the DaskSQL jar and ucp
which is a part of ucx-py
import jpype
import jpype.imports
import os
# Assumes JAVA_HOME is set
jvmpath = jpype.getDefaultJVMPath()
jpype.startJVM(
"-ea",
"--illegal-access=deny",
ignoreUnrecognized=True,
convertStrings=False,
jvmpath=jvmpath,
classpath=["dask_sql/jar/DaskSQL.jar"],
)
print("started jvm")
import ucp
import com
print(dir(com)) # This step causes segfault
In both cases moving the ucp/dask_cuda
import before jvm startup/dask-sql
import fixes the issue in these cases.
Here's the stacktrace for the above
[dt03:46991:0:46991] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xc)
==== backtrace (tid: 46991) ====
0 /miniconda3/envs/segfault-env/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x115) [0x7feca79654e5]
1 /miniconda3/envs/segfault-env/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2a881) [0x7feca7965881]
2 /miniconda3/envs/segfault-env/lib/python3.8/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2aa52) [0x7feca7965a52]
3 [0x7ff03554d7f8]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007ff03554d7f8 (sent by kill), pid=46991, tid=46991
#
# JRE version: OpenJDK Runtime Environment (11.0.9.1) (build 11.0.9.1-internal+0-adhoc..src)
# Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 849 c2 java.util.HashMap.containsKey(Ljava/lang/Object;)Z java.base@11.0.9.1-internal (18 bytes) @ 0x00007ff03554d7f8 [0x00007ff03554d740+0x00000000000000b8]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to core.46991)
#
# An error report file with more information is saved as:
# hs_err_pid46991.log
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)
Anything else we need to know?:
-
ucp when imported intercepts errors thrown during execution which explains why they only happen with that import. ( @pentschev probably knows more about this)
-
Wasn’t able to reproduce the error without the DaskSQL jar
-
There’s a possibility this is expected behavior from the JVM based on the Jpype user guide on errors. Specifically this:
As part of normal operations the JVM will trigger a segmentation fault when starting and when interrupting threads. Pythons fault handler can intercept these operations and interpret these as real faults.
Not sure if this implies that it’s normal execution incorrectly showing up as errors or there’s some underlying issue.
Environment:
- dask-sql version: from main @ 77f1d87ba7a3913d66afa8ceaf6b5e124bbf6644 JPype version 1.3
- Python version: 3.8
- Operating System: ubuntu 18.04
- Install method (conda, pip, source): from source in an environment with dask-cuda preinstalled with
conda install -c rapidsai-nightly -c nvidia-c conda-forge cudf dask-cudf dask-cuda python=3.7 cudatoolkit=11.2 openjdk maven ucx-py ucx-proc=*=gpu
Feel free to add more details or other reproducers that folks have seen running into the same issue. cc: @charlesbluca @jdye64 @pentschev @randerzander @VibhuJawa
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:8
Top GitHub Comments
@ayushdg with regard to the segfaults when the JVM starts this should happen in the jpype.startJVM command and it is caught by the JVM so the user never sees it. It is problematic because when you try to run with gdb to locate the source of a seg fault you will get this hidden segfault preventing you from finding the real source.
As far as interactions that are known to cause segfault. The library LABLAS has an issue the Java threading which will cause an internal failure if the threading portion of the library is started after the JVM is started. This is because the size of the page allocation changes when JVM installs its operating hooks. Similarly the Python signal handler routines and the JVM ones can have issue cooperating. Disabling the Python ones usually resolves this issue. But because the JVM ones are connected to the JVM diagnostics routines you will get a Java dump even if the failure did not occur in Java.
My recommendation is to run your minimal test case using the gdb line that I have in the JPype developers guide. It should skip the JVM start segfault and then stop at the problem spot. You can then get a backtrace to localize the problem. Please note in Python because of its reference counting often a failure can crop up in an unrelated module. That is the faulty module returns a resource which it is still holding on to and modifying which is then recycled and reused by a different module. The second innocent party then segfaults when it uses the resource. Locating the source of the fault can be a real bear.
Also make sure you are using the lastest JPype as 1.2 has a reference issue with arrays that can cause random crashes.
Should we close this now as it no longer seems to be an issue?