[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames
See original GitHub issuetest.py:
if __name__ == "__main__":
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster(protocol="tcp")
client = Client(cluster)
print(client)
from dask_sql import Context
import cudf
c = Context()
test_df = cudf.DataFrame({'id': [0, 1, 2]})
c.create_table("test", test_df)
# segfault
print(c.sql("select count(*) from test").compute())
EDIT: Leaving the below UCX snippet and trace for historical purposes, but the issue seems entirely unrelated to UCX.
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask_sql import Context
import pandas as pd
cluster = LocalCUDACluster(protocol="ucx")
client = Client(cluster)
c = Context()
test_df = pd.DataFrame({'id': [0, 1, 2]})
c.create_table("test", test_df)
# segfault
c.sql("select count(*) from test")
trace:
/home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/distributed-2022.2.1+8.g39c5e885-py3.9.egg/distributed/comm/ucx.py:83: UserWarning: A CUDA context for device 0 already exists on process ID 1251168. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
warnings.warn(
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
...
[rl-dgx2-r13-u7-rapids-dgx201:1232380:0:1232380] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:1232380) ====
0 /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f921c5883f5]
1 /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f921c588791]
2 /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d962) [0x7f921c588962]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f976d27b0c0]
4 [0x7f93a78e6b58]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f93a78e6b58, pid=1232380, tid=1232380
#
# JRE version: OpenJDK Runtime Environment (11.0.1+13) (build 11.0.1+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.1+13-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 1791 c2 java.util.Arrays.hashCode([Ljava/lang/Object;)I java.base@11.0.1 (56 bytes) @ 0x00007f93a78e6b58 [0x00007f93a78e6b20+0x0000000000000038]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/nfs/rgelhausen/notebooks/core.1232380)
#
# An error report file with more information is saved as:
# /home/nfs/rgelhausen/notebooks/hs_err_pid1232380.log
Compiled method (c2) 17616 1791 4 java.util.Arrays::hashCode (56 bytes)
total in heap [0x00007f93a78e6990,0x00007f93a78e6d80] = 1008
relocation [0x00007f93a78e6b08,0x00007f93a78e6b20] = 24
main code [0x00007f93a78e6b20,0x00007f93a78e6c60] = 320
stub code [0x00007f93a78e6c60,0x00007f93a78e6c78] = 24
metadata [0x00007f93a78e6c78,0x00007f93a78e6c80] = 8
scopes data [0x00007f93a78e6c80,0x00007f93a78e6ce8] = 104
scopes pcs [0x00007f93a78e6ce8,0x00007f93a78e6d48] = 96
dependencies [0x00007f93a78e6d48,0x00007f93a78e6d50] = 8
handler table [0x00007f93a78e6d50,0x00007f93a78e6d68] = 24
nul chk table [0x00007f93a78e6d68,0x00007f93a78e6d80] = 24
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
Top Results From Across the Web
Bug Squashing - GitHub
[BUG] Investigate group-by scan COUNT aggregation returning 0-based results #10237 opened by mythrocks bug. Something isn't working cuDF (Java) Affects Java ...
Read more >Unable to load and compute dask_cudf dataframe into blazing ...
Trying to load a file (CSV and Parquet) using Dask CUDF and seeing some memory related errors. The dataset can easily fit into...
Read more >CHANGELOG.md · Gitee 极速下载/cuDF - Gitee.com
Bug Fixes · PR #5221 Fix the use of user-provided resource on temporary values · PR #5181 Allocate null count using the default...
Read more >Exhaustive Learning - EDA Survey - Feedback Prize | Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from Feedback Prize - Evaluating Student Writing.
Read more >Release Notes - OmniSci
Resolved an issue where a syntax error in a SQL statement could cause the aggregator node to crash in a ... CREATE TABLE...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to triage this to environments with
cuml
resolving to older dask versions. This issue only comes up when we use the newer nightlies of cudf/dask-cudf with an older version of dask (2021.11.2).Here is a minimal reproducer outside dask-sql:
We should be good to close on this repo.
Side note: The reproducer outside dask-sql shows the actual error (python recursion depth exceeded), but with the jvm spun up it gets captured as a segfault with no additional information
fyi, the second snippet above fails regardless of UCX. Updating the issue title to reflect.
Also, I notice that the only conda environments in which these don’t fail are ones in which I’m building dask-sql, dask, and distributed from source. I’m going to keep peeling back env customizations to further narrow.