question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SLURMCluster worker does not find available tensorflow GPU

See original GitHub issue

Description

Dask workers operating on GPU available nodes only run tensorflow on CPU. Taking the same submission script outside of dask runs on GPU

Expected behavior

Dask workers would see available GPU devices when importing tensorflow

Versions

>>> distributed.__version__
'2.9.1'
>>> tf.__version__
'1.14.0'
>>> dask_jobqueue.__version__
'0.7.0'

MWE

Consider the following reproduciblish example (as much is possible on a SLURMCluster)

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

#job args
extra_args=[
    "--error=/home/b.weinstein/logs/dask-worker-%j.err",
    "--account=ewhite",
    "--output=/home/b.weinstein/logs/dask-worker-%j.out",
    "--partition=gpu",
    "--gpus=1"
]

cluster = SLURMCluster(
    processes=1,
    cores=1, 
    memory="20GB", 
    walltime='24:00:00',
    job_extra=extra_args,
    local_directory="/orange/ewhite/b.weinstein/NEON/logs/dask/", death_timeout=300)

print(cluster.job_script())
cluster.scale(2)    

client = Client(cluster)

#available
def available():
    import tensorflow as tf    
    return tf.test.is_gpu_available()

#list devices
def devices():
    from tensorflow.python.client import device_lib
    return device_lib.list_local_devices()

#submit 
future = client.submit(devices)
print(future.result())
    
#submit 
future = client.submit(available)
print(future.result())

returns

>>> print(future.result())
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/apps/tensorflow/1.14.0py3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3139368017992869205
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 9545327502143081909
physical_device_desc: "device: XLA_CPU device"
]
>>>
... #submit
... future = client.submit(available)
>>> print(future.result())
False

CPU only.

To prove the GPU is there, now consider the job script printed from the client

>>> print(cluster.job_script())
#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1

JOB_ID=${SLURM_JOB_ID%;*}

/apps/tensorflow/1.14.0py3/bin/python -m distributed.cli.dask_worker tcp://10.13.55.18:44006 --nthreads 1 --memory-limit 20.00GB --name name --nanny --death-timeout 300 --local-directory /orange/ewhite/b.weinstein/NEON/logs/dask/

Take that SLURM script and run it alone

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=19G
#SBATCH -t 24:00:00
#SBATCH --error=/home/b.weinstein/logs/dask-worker-%j.err
#SBATCH --account=ewhite
#SBATCH --output=/home/b.weinstein/logs/dask-worker-%j.out
#SBATCH --partition=gpu
#SBATCH --gpus=1

and then run the test

#available
def available():
    import tensorflow as tf    
    return tf.test.is_gpu_available()

#list devices
def devices():
    from tensorflow.python.client import device_lib
    return device_lib.list_local_devices()

print(available())

print(devices())

yields the expected behavior

[b.weinstein@login3 NEON_crown_maps]$ cat ~/logs/dask-worker-46459347.out
True
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16358656244715553424
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14366244434418621112
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 10812866560
locality {
  bus_id: 1
  links {
  }
}
incarnation: 11711177448480902725
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1a:00.0, compute capability: 7.5"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 13429929682801714225
physical_device_desc: "device: XLA_GPU device"
]

GPU is available.

my interpretation is that something about the scheduler (because its on CPU?) is not invoking tensorflow as expected. @TomAugspurger is this related to why dask-tensorflow was split off? Is there something different here?

from

https://docs.dask.org/en/latest/gpu.html

Dask doesn’t need to know that these functions use GPUs. It just runs Python functions. Whether or not those Python functions use a GPU is orthogonal to Dask. It will work regardless.

I was expecting this to be a common use case. Thanks for any feedback!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lestevecommented, Jan 30, 2020

Is it possible to add a worker to dask-jobqueue that is on the same node as the scheduler? It seems a huge waste to reserve a whole GPU node for the scheduler.

This is something I am wondering about as well: in my SLURM cluster, I can only get GPU nodes through SLURM jobs, so if I want my Dask scheduler in a SLURM job I also want to use the GPUs on this node. Note that for now, my Dask scheduler lives in the login node. Since the Dask scheduler is not consuming too much resource, this has never been a problem.

A hacky way would be to launch a subprocess dask-worker <scheduler-address> (more exactly the last line of cluster.job_script() with client.run_on_scheduler.

0reactions
bw4szcommented, Jan 30, 2020

exactly. Let’s close this, atleast someone in the future will know this is a challenge, or atleast a potential challenge on managed clusters. I’ve talked to admin here and they came up a convoluted set of invoking modules from python back to SLURM. My take-away is that if you have CPU only nodes, it might be just to launch from the GPU and take that hit. Then try to create a worker on that same node, if memory allows. Good to know for others. Thanks for your thoughts, i appreciate it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running TensorFlow on a Slurm Cluster? - python
I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how to do...
Read more >
SlurmClusterResolver should use env variables ob job ...
The SlurmClusterResolver has a few issues: num_accelerators returns the GPUs per node although they should be per ** task** according to ...
Read more >
TensorFlow on the HPC Clusters
Test the installation of the GPU version of TensorFlow by running a short job. First, download the necessary data. The compute nodes do...
Read more >
tf.distribute.cluster_resolver.SlurmClusterResolver
Returns None if such information is not available or is not applicable in the current distributed environment, such as training with tf.
Read more >
Deploying Rich Cluster API on DGX for Multi-User Sharing
NVIDIA DGX is the universal system for AI and Data Science infrastructure ... This is because the GPUs are not available to the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found