Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiment rechunking cupy array on DGX

See original GitHub issue

Using the DGX branch, and the tom-ucx distributed branch, I’m playing with rechunking a large 2d array from by row to by column

from dask_cuda import DGX
cluster = DGX(CUDA_VISIBLE_DEVICES=[0,1,2,3])
from dask.distributed import Client
client = Client(cluster)
import cupy, dask.array as da, numpy as np
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((40000, 40000), chunks=(None, '1 GiB')).persist()
y = x.rechunk(('1 GiB', -1)).persist()

This is a fun experiment because it’s a common operation, stresses UCX a bit, and is currently quite fast (when it works).

I’ve run into the following problems:

Spilling to disk when I run out of device memory (I don’t have any spill to disk things on at the moment)

Sometimes I get this error from the dask comm ucx code

  File "/home/nfs/mrocklin/distributed/distributed/comm/ucx.py", line 134, in read
    nframes, = struct.unpack("Q", obj[:8])  # first eight bytes for number of frames

Sometimes CURAND seems to dislike me

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x10cupy.cuda.curand\x94\x8c\x0bCURANDError\x94\x93\x94\x8c!CURAND_STATUS_PREEXISTING_FAILURE\x94\x85\x94R\x94}\x94\x8c\x06status\x94K\xcasb.'
Traceback (most recent call last):
  File "/home/nfs/mrocklin/distributed/distributed/worker.py", line 3193, in apply_function
    result = function(*args, **kwargs)
  File "/home/nfs/mrocklin/dask/dask/array/random.py", line 411, in _apply_random
    return func(*args, size=size, **kwargs)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 516, in random_sample
    out = self._random_sample_raw(size, dtype)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 505, in _random_sample_raw
    func(self._generator, out.data.ptr, out.size)
  File "cupy/cuda/curand.pyx", line 155, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 159, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 83, in cupy.cuda.curand.check_status
cupy.cuda.curand.CURANDError: CURAND_STATUS_PREEXISTING_FAILURE

I don’t plan to invesigate these personally at the moment, but I wanted to record the experiment somewhere (and this seems to currently be the best place?). I think that it might be useful to have someone like @madsbk or @pentschev look into this after the UCX and DGX work gets cleaned up a bit more.

Issue Analytics

State:
Created 4 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, May 29, 2019

Not urgent. I recommend waiting until tomorrow at least.

On Wed, May 29, 2019 at 2:10 PM Peter Andreas Entschev < notifications@github.com> wrote:

I will definitely dive into that, since I have a strong feeling that the memory spilling mechanism may not be working properly, or not active at all. How urgent is this for both of you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cuda/issues/59?email_source=notifications&email_token=AACKZTGRO4URF4VF5UEXXR3PX3IL7A5CNFSM4HQPFBW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQLJTA#issuecomment-497071308, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTBSFHHCIPX2USBYFATPX3IL7ANCNFSM4HQPFBWQ .

0reactions

pentschevcommented, Jan 8, 2021

There has been great progress on that over the last year or so, I’m closing this as I don’t think this is an issue anymore.