RawKernel does not work with dask.distributed
See original GitHub issueWhen I call a RawKernel
using on a remote worker using distributed
, e.g. ddf.map_partitions(kernel_wrapper)
, the worker crashes with a segfault. I suspect this is because of the following:
If I first execute the kernel on the client side, it compiles and runs successfully. After that, I submit the kernel for execution by the workers, at which point the NVRTC compiled code gets serialized. Once it gets sent to the other host, the CUDA context is different and the data structures of the compiled kernel are no longer valid. When I do not execute the kernel locally before submitting it to the workers (i.e. do not trigger compilation), everything runs fine.
- Conditions (you can just paste the output of
python -c 'import cupy; cupy.show_config()'
)- CuPy version: master, commit:dd2fb09f63fce8b7ddb58deb4b91f0b3669d1a0d
- OS/Platform: Linux PPCle64
- CUDA version: 10.1.243
- cuDNN/NCCL version N/A
- Code to reproduce
(sketch, I did not actually try to run the below code exactly as is)
import dask
import cupy as cp
kernel = cp.RawKernel(r'''
extern "C" __global__ void test_kernel()
{
printf("Test\n");
}
''', 'test_kernel')
def launch_kernel():
kernel((1,),(128,),())
# comment out the below line and it should run on the remote workers
launch_kernel() # should print out Test
from dask.distributed import Client
client = Client(scheduler_file='my-scheduler.json')
ddf = dask.datasets.timeseries()
ddf.map_partitions(launch_kernel) # Segfault on the workers
- Error messages, stack traces, or logs
distributed.nanny - INFO - Worker process 56138 was killed by signal 11
distributed.nanny - WARNING - Restarting worker
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (11 by maintainers)
Top Results From Across the Web
Parallelizing Custom CuPy Kernels with Dask - Medium
The motivation for this post comes from someone who recently asked how can one use Dask to distribute such custom kernels, which is...
Read more >Frequently Asked Questions - Dask.distributed
Depending on the scale of your cluster the dask scheduler may run into this ... For more problems with OMP_NUM_THREADS, ... Does Dask...
Read more >HSF/PyHEP - Gitter
The ElementwiseKernels work if you're doing an embarrassingly parallel problem, like numba.vectorize, but I sometimes need to mix information from different ...
Read more >dask.distributed not utilising the cluster - Stack Overflow
does dask.distributed work on pandas dataframe. can we use dask.delayed in dask.distributed. If the above programming approach is wrong, can you ...
Read more >High-throughput virtual laboratory for drug ... - SAGE Journals
Overview of the problem ... vast chemical space rapidly with a drug discovery work- ... does not pick out all effective ligands, as...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That is exactly my plan 👍
Thanks all! 😄