question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiment rechunking cupy array on DGX

See original GitHub issue

Using the DGX branch, and the tom-ucx distributed branch, I’m playing with rechunking a large 2d array from by row to by column

from dask_cuda import DGX
cluster = DGX(CUDA_VISIBLE_DEVICES=[0,1,2,3])
from dask.distributed import Client
client = Client(cluster)
import cupy, dask.array as da, numpy as np
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((40000, 40000), chunks=(None, '1 GiB')).persist()
y = x.rechunk(('1 GiB', -1)).persist()

This is a fun experiment because it’s a common operation, stresses UCX a bit, and is currently quite fast (when it works).

I’ve run into the following problems:

  1. Spilling to disk when I run out of device memory (I don’t have any spill to disk things on at the moment)

  2. Sometimes I get this error from the dask comm ucx code

      File "/home/nfs/mrocklin/distributed/distributed/comm/ucx.py", line 134, in read
        nframes, = struct.unpack("Q", obj[:8])  # first eight bytes for number of frames
    
  3. Sometimes CURAND seems to dislike me

distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x10cupy.cuda.curand\x94\x8c\x0bCURANDError\x94\x93\x94\x8c!CURAND_STATUS_PREEXISTING_FAILURE\x94\x85\x94R\x94}\x94\x8c\x06status\x94K\xcasb.'
Traceback (most recent call last):
  File "/home/nfs/mrocklin/distributed/distributed/worker.py", line 3193, in apply_function
    result = function(*args, **kwargs)
  File "/home/nfs/mrocklin/dask/dask/array/random.py", line 411, in _apply_random
    return func(*args, size=size, **kwargs)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 516, in random_sample
    out = self._random_sample_raw(size, dtype)
  File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 505, in _random_sample_raw
    func(self._generator, out.data.ptr, out.size)
  File "cupy/cuda/curand.pyx", line 155, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 159, in cupy.cuda.curand.generateUniformDouble
  File "cupy/cuda/curand.pyx", line 83, in cupy.cuda.curand.check_status
cupy.cuda.curand.CURANDError: CURAND_STATUS_PREEXISTING_FAILURE

I don’t plan to invesigate these personally at the moment, but I wanted to record the experiment somewhere (and this seems to currently be the best place?). I think that it might be useful to have someone like @madsbk or @pentschev look into this after the UCX and DGX work gets cleaned up a bit more.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, May 29, 2019

Not urgent. I recommend waiting until tomorrow at least.

On Wed, May 29, 2019 at 2:10 PM Peter Andreas Entschev < notifications@github.com> wrote:

I will definitely dive into that, since I have a strong feeling that the memory spilling mechanism may not be working properly, or not active at all. How urgent is this for both of you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/dask-cuda/issues/59?email_source=notifications&email_token=AACKZTGRO4URF4VF5UEXXR3PX3IL7A5CNFSM4HQPFBW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWQLJTA#issuecomment-497071308, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTBSFHHCIPX2USBYFATPX3IL7ANCNFSM4HQPFBWQ .

0reactions
pentschevcommented, Jan 8, 2021

There has been great progress on that over the last year or so, I’m closing this as I don’t think this is an issue anymore.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Basics of CuPy — CuPy 11.4.0 documentation
CuPy is a GPU array backend that implements a subset of NumPy interface. In the following code, cp is an abbreviation of cupy...
Read more >
cupy.array — CuPy 11.4.0 documentation
This method currently does not support subok argument. Note. If obj is an numpy.ndarray instance that contains big-endian data, this ...
Read more >
Interoperability — CuPy 11.4.0 documentation
This enables NumPy ufuncs to be directly operated on CuPy arrays. ... mpi4py now provides (experimental) support for passing CuPy arrays to MPI...
Read more >
cupy.ndarray — CuPy 11.4.0 documentation
The difference is that this class allocates the array content on the current GPU device. Parameters. shape (tuple of ints) – Length of...
Read more >
cupy.array_equal — CuPy 11.4.0 documentation
equal_nan (bool) – If True , NaN's in a1 will be considered equal to NaN's in a2 . Returns. A boolean 0-dim array....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found