Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QST] RuntimeError - Something wrong with cuda runtime

See original GitHub issue

Hi everyone, really impressed with a lot of the gains we have gotten with NVTabular but we are starting to see this cryptic cuda error when we try to “fit” a workflow with a docker container.

Here’s the error traceback and I would appreciate any pointers!

    workflow.fit(merged_dataset)
  File "/nvtabular/nvtabular/workflow.py", line 160, in fit
    results = [r.result() for r in self.client.compute(stats)]
  File "/nvtabular/nvtabular/workflow.py", line 160, in <listcomp>
    results = [r.result() for r in self.client.compute(stats)]
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 220, in result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvtabular/nvtabular/ops/categorify.py", line 874, in _write_uniques
    df = type(df)(new_cols)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 301, in __init__
    self._init_from_dict_like(data, index=index, columns=columns)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 459, in _init_from_dict_like
    data, index = self._align_input_series_indices(data, index=index)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 530, in _align_input_series_indices
    aligned_input_series = cudf.core.series._align_indices(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7203, in _align_indices
    result = [
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7204, in <listcomp>
    sr._align_to_index(
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 6366, in _align_to_index
    if self.index.equals(index):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 1767, in equals
    return super().equals(other)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 231, in equals
    return super(Index, self).equals(other, check_types=check_types)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py", line 558, in equals
    if not self_col.equals(other_col, check_dtypes=check_types):
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 183, in equals
    null_equals = self._null_equals(other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 187, in _null_equals
    return self.binary_operator("NULL_EQUALS", other)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 132, in binary_operator
    return _numeric_column_binop(
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 723, in _numeric_column_binop
    out = libcudf.binaryop.binaryop(lhs, rhs, op, out_dtype)
  File "cudf/_lib/binaryop.pyx", line 194, in cudf._lib.binaryop.binaryop
  File "cudf/_lib/binaryop.pyx", line 110, in cudf._lib.binaryop.binaryop_v_v
RuntimeError: Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++14 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Header names:
  algorithm
  binaryop/jit/operation-udf.hpp
  binaryop/jit/operation.hpp
  binaryop/jit/traits.hpp
  cassert
  cfloat
  climits
  cmath
  cstddef
  cstdint
  ctime
  cuda/std/chrono
  cuda/std/climits
  cuda/std/cstddef
  cuda/std/limits
  cuda/std/type_traits
  cuda_runtime.h
  cudf/detail/utilities/assert.cuh
  cudf/fixed_point/fixed_point.hpp
  cudf/types.hpp
  cudf/utilities/bit.hpp
  cudf/wrappers/durations.hpp
  cudf/wrappers/timestamps.hpp
  detail/__config
  detail/__pragma_pop
  detail/__pragma_push
  detail/libcxx/include/chrono
  detail/libcxx/include/climits
  detail/libcxx/include/cstddef
  detail/libcxx/include/ctime
  detail/libcxx/include/limits
  detail/libcxx/include/ratio
  detail/libcxx/include/type_traits
  detail/libcxx/include/version
  iterator
  libcxx/include/__config
  libcxx/include/__pragma_pop
  libcxx/include/__pragma_push
  libcxx/include/__undef_macros
  limits
  ratio
  string
  type_traits
  version

detail/libcxx/include/limits(33): error: floating constant is out of range

detail/libcxx/include/limits(39): error: floating constant is out of range

2 errors detected in the compilation of "binaryop/jit/kernel.cu".

For some information - I are running 0.5.2 NVidia-merlin image (nvcr.io/nvidia/merlin/merlin-pytorch-training:0.5.2) and git pulling the 0.6.0 commit (886d5b85fee83acfefc3f60c282f723f41719d53) in /nvtabular.

This job is being run in a docker container in EKS (AWS), with 460.73.01 nvidia driver, and cuda 11.2.

When I printenv, I do notice this environment variable but I am not sure what to do about it:

_CUDA_COMPAT_STATUS=System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803.

Any ideas on what I may be missing here?

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

benfredcommented, Aug 6, 2021

thanks! glad that you got this working.

Tracking the gevent changes in the containers here https://github.com/NVIDIA-Merlin/Merlin/issues/27

1reaction

Arnie0426commented, Aug 3, 2021

So update on this: I am not entirely sure what the root cause was but I managed to upgrade our AMI to amazon-eks-gpu-node-1.19-v20210722 and I haven’t seen this error since. I looked at the release notes of the last few amis and I don’t see any update to the underlying cuda drivers - so I am not entirely sure what changed.

For anyone curious, I confirmed on a barebones 202010504 AMI that the issue above can be replicated with:

import nvtabular as nvt
import cudf
df = cudf.DataFrame({
    'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A', 'User_A', 'User_A', 'User_A', 'User_A'],
    'productID': [100, 101, 102, 101, 102, 103, 103, 104, 105, 106, 107],
    'label': [0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1]
})
dataset = nvt.Dataset(df)

grouped = ["author", "productID"] >> nvt.ops.Groupby(groupby_cols=["author"], aggs="count") 

workflow = nvt.Workflow(grouped)
non_spam_authors_dataset = workflow.fit_transform(dataset)
merged_dataset = nvt.Dataset.merge(non_spam_authors_dataset, dataset, on="author", how="inner")

operations = ["author", "productID", "label"] >> nvt.ops.Categorify()
wf2 = nvt.Workflow(operations)

wf2.fit(merged_dataset)

Top Results From Across the Web

CUDA error: device-side assert triggered on Colab

But I am getting this strange error. RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported ...

CUDA runtime implicit initialization on GPU:0 failed - TAO Toolkit

I'm running the TAO Toolkit (tao-toolkit-tf:v3.21.08-py3) on an HPC Cluster using singularity. The setup was straightforward.

CUDA error: an illegal memory access was encountered - Part ...

When I am running following code on Gradient, it is working fine but it is throwing me error after running for few seconds...

CUDNN_STATUS_INTERNAL_E...

... a Cuda runtime error (4): unspecified launch failure or segmentation fault (core dumped). I supose there's something wrong with my cuda ...

Pytorch: no CUDA-capable device is detected on Linux

basics/U_lrw1.npy')[:,:6]).cuda() RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:74.