[QST] RuntimeError - Something wrong with cuda runtime
See original GitHub issueHi everyone, really impressed with a lot of the gains we have gotten with NVTabular but we are starting to see this cryptic cuda error when we try to “fit” a workflow with a docker container.
Here’s the error traceback and I would appreciate any pointers!
workflow.fit(merged_dataset)
File "/nvtabular/nvtabular/workflow.py", line 160, in fit
results = [r.result() for r in self.client.compute(stats)]
File "/nvtabular/nvtabular/workflow.py", line 160, in <listcomp>
results = [r.result() for r in self.client.compute(stats)]
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 220, in result
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/nvtabular/nvtabular/ops/categorify.py", line 874, in _write_uniques
df = type(df)(new_cols)
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 301, in __init__
self._init_from_dict_like(data, index=index, columns=columns)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 459, in _init_from_dict_like
data, index = self._align_input_series_indices(data, index=index)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/dataframe.py", line 530, in _align_input_series_indices
aligned_input_series = cudf.core.series._align_indices(
File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7203, in _align_indices
result = [
File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 7204, in <listcomp>
sr._align_to_index(
File "/opt/conda/lib/python3.8/site-packages/cudf/core/series.py", line 6366, in _align_to_index
if self.index.equals(index):
File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 1767, in equals
return super().equals(other)
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/index.py", line 231, in equals
return super(Index, self).equals(other, check_types=check_types)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/frame.py", line 558, in equals
if not self_col.equals(other_col, check_dtypes=check_types):
File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 183, in equals
null_equals = self._null_equals(other)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/column.py", line 187, in _null_equals
return self.binary_operator("NULL_EQUALS", other)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 132, in binary_operator
return _numeric_column_binop(
File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.8/site-packages/cudf/core/column/numerical.py", line 723, in _numeric_column_binop
out = libcudf.binaryop.binaryop(lhs, rhs, op, out_dtype)
File "cudf/_lib/binaryop.pyx", line 194, in cudf._lib.binaryop.binaryop
File "cudf/_lib/binaryop.pyx", line 110, in cudf._lib.binaryop.binaryop_v_v
RuntimeError: Compilation failed: NVRTC_ERROR_COMPILATION
Compiler options: "-std=c++14 -D__CUDACC_RTC__ -default-device -arch=sm_70"
Header names:
algorithm
binaryop/jit/operation-udf.hpp
binaryop/jit/operation.hpp
binaryop/jit/traits.hpp
cassert
cfloat
climits
cmath
cstddef
cstdint
ctime
cuda/std/chrono
cuda/std/climits
cuda/std/cstddef
cuda/std/limits
cuda/std/type_traits
cuda_runtime.h
cudf/detail/utilities/assert.cuh
cudf/fixed_point/fixed_point.hpp
cudf/types.hpp
cudf/utilities/bit.hpp
cudf/wrappers/durations.hpp
cudf/wrappers/timestamps.hpp
detail/__config
detail/__pragma_pop
detail/__pragma_push
detail/libcxx/include/chrono
detail/libcxx/include/climits
detail/libcxx/include/cstddef
detail/libcxx/include/ctime
detail/libcxx/include/limits
detail/libcxx/include/ratio
detail/libcxx/include/type_traits
detail/libcxx/include/version
iterator
libcxx/include/__config
libcxx/include/__pragma_pop
libcxx/include/__pragma_push
libcxx/include/__undef_macros
limits
ratio
string
type_traits
version
detail/libcxx/include/limits(33): error: floating constant is out of range
detail/libcxx/include/limits(39): error: floating constant is out of range
2 errors detected in the compilation of "binaryop/jit/kernel.cu".
For some information - I are running 0.5.2 NVidia-merlin image (nvcr.io/nvidia/merlin/merlin-pytorch-training:0.5.2) and git pull
ing the 0.6.0 commit (886d5b85fee83acfefc3f60c282f723f41719d53
) in /nvtabular
.
This job is being run in a docker container in EKS (AWS), with 460.73.01
nvidia driver, and cuda 11.2.
When I printenv
, I do notice this environment variable but I am not sure what to do about it:
_CUDA_COMPAT_STATUS=System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803
.
Any ideas on what I may be missing here?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
thanks! glad that you got this working.
Tracking the gevent changes in the containers here https://github.com/NVIDIA-Merlin/Merlin/issues/27
So update on this: I am not entirely sure what the root cause was but I managed to upgrade our AMI to
amazon-eks-gpu-node-1.19-v20210722
and I haven’t seen this error since. I looked at the release notes of the last few amis and I don’t see any update to the underlying cuda drivers - so I am not entirely sure what changed.For anyone curious, I confirmed on a barebones 202010504 AMI that the issue above can be replicated with: