question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RawKernel does not work with dask.distributed

See original GitHub issue

When I call a RawKernel using on a remote worker using distributed, e.g. ddf.map_partitions(kernel_wrapper), the worker crashes with a segfault. I suspect this is because of the following:

If I first execute the kernel on the client side, it compiles and runs successfully. After that, I submit the kernel for execution by the workers, at which point the NVRTC compiled code gets serialized. Once it gets sent to the other host, the CUDA context is different and the data structures of the compiled kernel are no longer valid. When I do not execute the kernel locally before submitting it to the workers (i.e. do not trigger compilation), everything runs fine.

  • Conditions (you can just paste the output of python -c 'import cupy; cupy.show_config()')
    • CuPy version: master, commit:dd2fb09f63fce8b7ddb58deb4b91f0b3669d1a0d
    • OS/Platform: Linux PPCle64
    • CUDA version: 10.1.243
    • cuDNN/NCCL version N/A
  • Code to reproduce

(sketch, I did not actually try to run the below code exactly as is)

import dask
import cupy as cp

kernel = cp.RawKernel(r'''

extern "C" __global__ void test_kernel()
    {
    printf("Test\n");
    }
''', 'test_kernel')

def launch_kernel():
    kernel((1,),(128,),())

# comment out the below line and it should run on the remote workers
launch_kernel() # should print out Test

from dask.distributed import Client
client = Client(scheduler_file='my-scheduler.json')
ddf = dask.datasets.timeseries()
ddf.map_partitions(launch_kernel) # Segfault on the workers
  • Error messages, stack traces, or logs
distributed.nanny - INFO - Worker process 56138 was killed by signal 11
distributed.nanny - WARNING - Restarting worker

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
leofangcommented, Oct 1, 2020

That is exactly my plan 👍

2reactions
jakirkhamcommented, Oct 12, 2020

Thanks all! 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelizing Custom CuPy Kernels with Dask - Medium
The motivation for this post comes from someone who recently asked how can one use Dask to distribute such custom kernels, which is...
Read more >
Frequently Asked Questions - Dask.distributed
Depending on the scale of your cluster the dask scheduler may run into this ... For more problems with OMP_NUM_THREADS, ... Does Dask...
Read more >
HSF/PyHEP - Gitter
The ElementwiseKernels work if you're doing an embarrassingly parallel problem, like numba.vectorize, but I sometimes need to mix information from different ...
Read more >
dask.distributed not utilising the cluster - Stack Overflow
does dask.distributed work on pandas dataframe. can we use dask.delayed in dask.distributed. If the above programming approach is wrong, can you ...
Read more >
High-throughput virtual laboratory for drug ... - SAGE Journals
Overview of the problem ... vast chemical space rapidly with a drug discovery work- ... does not pick out all effective ligands, as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found