Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SLURMCluster worker does not find available tensorflow GPU

See original GitHub issue

Description

Dask workers operating on GPU available nodes only run tensorflow on CPU. Taking the same submission script outside of dask runs on GPU

Expected behavior

Dask workers would see available GPU devices when importing tensorflow

Versions

>>> distributed.__version__
'2.9.1'
>>> tf.__version__
'1.14.0'
>>> dask_jobqueue.__version__
'0.7.0'

MWE

Consider the following reproduciblish example (as much is possible on a SLURMCluster)

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

#job args
extra_args=[
    "--error=/home/b.weinstein/logs/dask-worker-%j.err",
    "--account=ewhite",
    "--output=/home/b.weinstein/logs/dask-worker-%j.out",
    "--partition=gpu",
    "--gpus=1"
]

cluster = SLURMCluster(
    processes=1,
    cores=1, 
    memory="20GB", 
    walltime='24:00:00',
    job_extra=extra_args,
    local_directory="/orange/ewhite/b.weinstein/NEON/logs/dask/", death_timeout=300)

print(cluster.job_script())
cluster.scale(2)    

client = Client(cluster)

#available
def available():
    import tensorflow as tf    
    return tf.test.is_gpu_available()

#list devices
def devices():
    from tensorflow.python.client import device_lib
    return device_lib.list_local_devices()

#submit 
future = client.submit(devices)
print(future.result())
    
#submit 
future = client.submit(available)
print(future.result())

returns

>>> print(future.result())
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3139368017992869205
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 9545327502143081909
physical_device_desc: "device: XLA_CPU device"
]
>>>
... #submit
... future = client.submit(available)
>>> print(future.result())
False

CPU only.

To prove the GPU is there, now consider the job script printed from the client

>>> print(cluster.job_script())
#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1

JOB_ID=${SLURM_JOB_ID%;*}

/apps/tensorflow/1.14.0py3/bin/python -m distributed.cli.dask_worker tcp://10.13.55.18:44006 --nthreads 1 --memory-limit 20.00GB --name name --nanny --death-timeout 300 --local-directory /orange/ewhite/b.weinstein/NEON/logs/dask/

Take that SLURM script and run it alone

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1

and then run the test

#available
def available():
    import tensorflow as tf    
    return tf.test.is_gpu_available()

#list devices
def devices():
    from tensorflow.python.client import device_lib
    return device_lib.list_local_devices()

print(available())

print(devices())

yields the expected behavior

[b.weinstein@login3 NEON_crown_maps]$ cat ~/logs/dask-worker-46459347.out
True
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16358656244715553424
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14366244434418621112
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 10812866560
locality {
  bus_id: 1
  links {
  }
}
incarnation: 11711177448480902725
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1a:00.0, compute capability: 7.5"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 13429929682801714225
physical_device_desc: "device: XLA_GPU device"
]

GPU is available.

my interpretation is that something about the scheduler (because its on CPU?) is not invoking tensorflow as expected. @TomAugspurger is this related to why dask-tensorflow was split off? Is there something different here?

from

https://docs.dask.org/en/latest/gpu.html

Dask doesn’t need to know that these functions use GPUs. It just runs Python functions. Whether or not those Python functions use a GPU is orthogonal to Dask. It will work regardless.

I was expecting this to be a common use case. Thanks for any feedback!

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

lestevecommented, Jan 30, 2020

Is it possible to add a worker to dask-jobqueue that is on the same node as the scheduler? It seems a huge waste to reserve a whole GPU node for the scheduler.

This is something I am wondering about as well: in my SLURM cluster, I can only get GPU nodes through SLURM jobs, so if I want my Dask scheduler in a SLURM job I also want to use the GPUs on this node. Note that for now, my Dask scheduler lives in the login node. Since the Dask scheduler is not consuming too much resource, this has never been a problem.

A hacky way would be to launch a subprocess dask-worker <scheduler-address> (more exactly the last line of cluster.job_script() with client.run_on_scheduler.

0reactions

bw4szcommented, Jan 30, 2020

exactly. Let’s close this, atleast someone in the future will know this is a challenge, or atleast a potential challenge on managed clusters. I’ve talked to admin here and they came up a convoluted set of invoking modules from python back to SLURM. My take-away is that if you have CPU only nodes, it might be just to launch from the GPU and take that hit. Then try to create a worker on that same node, if memory allows. Good to know for others. Thanks for your thoughts, i appreciate it.