SLURMCluster worker does not find available tensorflow GPU
See original GitHub issueDescription
Dask workers operating on GPU available nodes only run tensorflow on CPU. Taking the same submission script outside of dask runs on GPU
Expected behavior
Dask workers would see available GPU devices when importing tensorflow
Versions
>>> distributed.__version__
'2.9.1'
>>> tf.__version__
'1.14.0'
>>> dask_jobqueue.__version__
'0.7.0'
MWE
Consider the following reproduciblish example (as much is possible on a SLURMCluster)
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
#job args
extra_args=[
"--error=/home/b.weinstein/logs/dask-worker-%j.err",
"--account=ewhite",
"--output=/home/b.weinstein/logs/dask-worker-%j.out",
"--partition=gpu",
"--gpus=1"
]
cluster = SLURMCluster(
processes=1,
cores=1,
memory="20GB",
walltime='24:00:00',
job_extra=extra_args,
local_directory="/orange/ewhite/b.weinstein/NEON/logs/dask/", death_timeout=300)
print(cluster.job_script())
cluster.scale(2)
client = Client(cluster)
#available
def available():
import tensorflow as tf
return tf.test.is_gpu_available()
#list devices
def devices():
from tensorflow.python.client import device_lib
return device_lib.list_local_devices()
#submit
future = client.submit(devices)
print(future.result())
#submit
future = client.submit(available)
print(future.result())
returns
>>> print(future.result())
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3139368017992869205
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 9545327502143081909
physical_device_desc: "device: XLA_CPU device"
]
>>>
... #submit
... future = client.submit(available)
>>> print(future.result())
False
CPU only.
To prove the GPU is there, now consider the job script printed from the client
>>> print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1
JOB_ID=${SLURM_JOB_ID%;*}
/apps/tensorflow/1.14.0py3/bin/python -m distributed.cli.dask_worker tcp://10.13.55.18:44006 --nthreads 1 --memory-limit 20.00GB --name name --nanny --death-timeout 300 --local-directory /orange/ewhite/b.weinstein/NEON/logs/dask/
Take that SLURM script and run it alone
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1
and then run the test
#available
def available():
import tensorflow as tf
return tf.test.is_gpu_available()
#list devices
def devices():
from tensorflow.python.client import device_lib
return device_lib.list_local_devices()
print(available())
print(devices())
yields the expected behavior
[b.weinstein@login3 NEON_crown_maps]$ cat ~/logs/dask-worker-46459347.out
True
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16358656244715553424
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14366244434418621112
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 10812866560
locality {
bus_id: 1
links {
}
}
incarnation: 11711177448480902725
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1a:00.0, compute capability: 7.5"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 13429929682801714225
physical_device_desc: "device: XLA_GPU device"
]
GPU is available.
my interpretation is that something about the scheduler (because its on CPU?) is not invoking tensorflow as expected. @TomAugspurger is this related to why dask-tensorflow was split off? Is there something different here?
from
https://docs.dask.org/en/latest/gpu.html
Dask doesn’t need to know that these functions use GPUs. It just runs Python functions. Whether or not those Python functions use a GPU is orthogonal to Dask. It will work regardless.
I was expecting this to be a common use case. Thanks for any feedback!
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
This is something I am wondering about as well: in my SLURM cluster, I can only get GPU nodes through SLURM jobs, so if I want my Dask scheduler in a SLURM job I also want to use the GPUs on this node. Note that for now, my Dask scheduler lives in the login node. Since the Dask scheduler is not consuming too much resource, this has never been a problem.
A hacky way would be to launch a subprocess
dask-worker <scheduler-address>
(more exactly the last line ofcluster.job_script()
with client.run_on_scheduler.exactly. Let’s close this, atleast someone in the future will know this is a challenge, or atleast a potential challenge on managed clusters. I’ve talked to admin here and they came up a convoluted set of invoking modules from python back to SLURM. My take-away is that if you have CPU only nodes, it might be just to launch from the GPU and take that hit. Then try to create a worker on that same node, if memory allows. Good to know for others. Thanks for your thoughts, i appreciate it.