Spill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues
See original GitHub issueSpill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues
After the libcudf++
merge, the spill over mechanism might be failing.
The current hypothesis is, in dask-cuda it looks like if it spills to disk when moving back to the GPU it will allocate via numba instead of RMM.
Relevant code lines are:
From Dask-cuda:
From distributed (example of how it should be handled ):
CC: @jakirkham @pentschev @kkraus14 .
Code to recreate the issue:
https://gist.github.com/VibhuJawa/dbf2573954db86fb193b687022a20f46
Note:
I have not run the cleaned up
code again on exp01
but the issue should still be there. (Exp-01 was busy)
Stack Trace
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
allocator()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
driver.cuMemAlloc(byref(ptr), bytesize)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
data[k] = self.data[k]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
return self.device_buffer[key]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
return self.slow_to_fast(key)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
value = self.slow[key]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
return self.load(self.d[key])
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
return fn(*args, **kws)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
devobj = from_array_like(obj, stream=stream)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
writeback=ary, stream=stream, gpu_data=gpu_data)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
gpu_data = devices.get_context().memalloc(self.alloc_size)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
self._attempt_allocation(allocator)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
allocator()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
driver.cuMemAlloc(byref(ptr), bytesize)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb3dd96c410>>, <Task finished coro=<Worker.execute() done, defined at /raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py:2438> exception=CudaAPIError(2, 'Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY')>)
Traceback (most recent call last):
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
allocator()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
driver.cuMemAlloc(byref(ptr), bytesize)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
data[k] = self.data[k]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
return self.device_buffer[key]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
return self.slow_to_fast(key)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
value = self.slow[key]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
return self.load(self.d[key])
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
return fn(*args, **kws)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
devobj = from_array_like(obj, stream=stream)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
writeback=ary, stream=stream, gpu_data=gpu_data)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
gpu_data = devices.get_context().memalloc(self.alloc_size)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
self._attempt_allocation(allocator)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
allocator()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
driver.cuMemAlloc(byref(ptr), bytesize)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
allocator()
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
driver.cuMemAlloc(byref(ptr), bytesize)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
self._check_error(fname, retcode)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Issue Analytics
- State:
- Created 4 years ago
- Comments:18 (18 by maintainers)
Top Results From Across the Web
CUDA_ERROR_OUT_OF_MEM...
In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first...
Read more >cudf.DataFrame.merge - RAPIDS Docs
Merge GPU DataFrame objects by performing a database-style join operation by columns or indexes. Parameters. rightDataFrame: onlabel or list; ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks a lot @VibhuJawa for testing this. I’ll make sure this is merged for 0.12, will leave this issue open until we merge it there.
Yup, I believe so.
I tested it on the same environment by just doing a source install of
dask-cuda
(branch 277).I.E, It works on below :
And Fails on below: