Experiment rechunking cupy array on DGX
See original GitHub issueUsing the DGX branch, and the tom-ucx distributed branch, I’m playing with rechunking a large 2d array from by row to by column
from dask_cuda import DGX
cluster = DGX(CUDA_VISIBLE_DEVICES=[0,1,2,3])
from dask.distributed import Client
client = Client(cluster)
import cupy, dask.array as da, numpy as np
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.random((40000, 40000), chunks=(None, '1 GiB')).persist()
y = x.rechunk(('1 GiB', -1)).persist()
This is a fun experiment because it’s a common operation, stresses UCX a bit, and is currently quite fast (when it works).
I’ve run into the following problems:
-
Spilling to disk when I run out of device memory (I don’t have any spill to disk things on at the moment)
-
Sometimes I get this error from the dask comm ucx code
File "/home/nfs/mrocklin/distributed/distributed/comm/ucx.py", line 134, in read nframes, = struct.unpack("Q", obj[:8]) # first eight bytes for number of frames
-
Sometimes CURAND seems to dislike me
distributed.protocol.pickle - INFO - Failed to deserialize b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x10cupy.cuda.curand\x94\x8c\x0bCURANDError\x94\x93\x94\x8c!CURAND_STATUS_PREEXISTING_FAILURE\x94\x85\x94R\x94}\x94\x8c\x06status\x94K\xcasb.'
Traceback (most recent call last):
File "/home/nfs/mrocklin/distributed/distributed/worker.py", line 3193, in apply_function
result = function(*args, **kwargs)
File "/home/nfs/mrocklin/dask/dask/array/random.py", line 411, in _apply_random
return func(*args, size=size, **kwargs)
File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 516, in random_sample
out = self._random_sample_raw(size, dtype)
File "/raid/mrocklin/miniconda/envs/ucx/lib/python3.7/site-packages/cupy/random/generator.py", line 505, in _random_sample_raw
func(self._generator, out.data.ptr, out.size)
File "cupy/cuda/curand.pyx", line 155, in cupy.cuda.curand.generateUniformDouble
File "cupy/cuda/curand.pyx", line 159, in cupy.cuda.curand.generateUniformDouble
File "cupy/cuda/curand.pyx", line 83, in cupy.cuda.curand.check_status
cupy.cuda.curand.CURANDError: CURAND_STATUS_PREEXISTING_FAILURE
I don’t plan to invesigate these personally at the moment, but I wanted to record the experiment somewhere (and this seems to currently be the best place?). I think that it might be useful to have someone like @madsbk or @pentschev look into this after the UCX and DGX work gets cleaned up a bit more.
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
Not urgent. I recommend waiting until tomorrow at least.
On Wed, May 29, 2019 at 2:10 PM Peter Andreas Entschev < notifications@github.com> wrote:
There has been great progress on that over the last year or so, I’m closing this as I don’t think this is an issue anymore.