Fail to run tune with tensorflow and gpu
See original GitHub issueSystem information
- OS Platform and Distribution: Linux Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: 0.6.5
- Python version: 3.6
Describe the problem
I am trying to use ray with tensorflow following the tutorial (link)
And I got a tune error
:
error log
Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
- train_gan_0_partition=0: ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
- train_gan_1_partition=1: ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/1 GPUs
Memory usage on this node: 53.0/67.5 GB
Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
- train_gan_0_partition=0: ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
- train_gan_1_partition=1: ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt
Traceback (most recent call last):
File "train.py", line 142, in <module>
**gan_spec)
File "/lib/python3.6/site-packages/ray/tune/tune.py", line 253, in run
raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_gan_0_partition=0, train_gan_1_partition=1])
Source code / logs
The code related to ray use:
# !!! Entrypoint for ray.tune !!!
def train(config={'partition': 0}, reporter=None):
global status_reporter, partition_fn
status_reporter = reporter
partition_fn = config['partition']
tf.app.run(main=main)
# !!! Example of using the ray.tune Python API !!!
if __name__ == "__main__":
try:
register_trainable('train_gan', train)
gan_spec = {
'stop': {
'time_total_s': 600,
},
'config': {
'partition': grid_search([0, 1]),
},
}
ray.init()
tune.run('train_gan',
name='tune_gan_test',
resources_per_trial={"gpu":1},
raise_on_failed_trial=True,
queue_trials=True,
with_server=False,
**gan_spec)
except KeyboardInterrupt:
os._exists(1)
How could I fix this? Thx for your help : )
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (4 by maintainers)
Top Results From Across the Web
Fail to run ray tune with tensorflow and gpu - Stack Overflow
OS Platform and Distribution: Linux Ubuntu 16.04; Ray installed from (source or binary): binary; Ray version: 0.6.5; Python version: 3.6.
Read more >Tune.run() with docker is not using gpu - Ray
High: It blocks me to complete my task. Hi, tune.run() does not recognise the GPU for training. I am using tensorflow with ray....
Read more >Use a GPU | TensorFlow Core
TensorFlow supports running computations on a variety of types of devices, including CPU and GPU. They are represented with string identifiers for example:....
Read more >Getting started with TensorFlow large model support - IBM
If the model is unable to run within GPU memory while using a swapout_threshold of 1, the next step is to begin enabling...
Read more >TensorFlow User Guide - NVIDIA Documentation Center
This guide also provides documentation on the NVIDIA TensorFlow ... The nvidia-docker images come prepackaged, tuned, and ready to run; ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
hm, can you check that in the main file,
status_reporter
is not None? You can do this by turningverbose=2
intune.run
and then printing the arguments in the training loop.Please check whether you are using the GPU version Pytorch or Tensorflow.