question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fail to run tune with tensorflow and gpu

See original GitHub issue

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.6.5
  • Python version: 3.6

Describe the problem

I am trying to use ray with tensorflow following the tutorial (link) And I got a tune error:

error log


Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/1 GPUs
Memory usage on this node: 53.0/67.5 GB
Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    **gan_spec)
  File "/lib/python3.6/site-packages/ray/tune/tune.py", line 253, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_gan_0_partition=0, train_gan_1_partition=1])

Source code / logs

The code related to ray use:

# !!! Entrypoint for ray.tune !!!
def train(config={'partition': 0}, reporter=None):
    global status_reporter, partition_fn
    status_reporter = reporter
    partition_fn = config['partition']
    tf.app.run(main=main)


# !!! Example of using the ray.tune Python API !!!
if __name__ == "__main__":
    try:
        register_trainable('train_gan', train)
        gan_spec = {
            'stop': {
                'time_total_s': 600,
            },
            'config': {
                'partition': grid_search([0, 1]),
            },
        }

        ray.init()

        tune.run('train_gan',
                 name='tune_gan_test',
                 resources_per_trial={"gpu":1},
                 raise_on_failed_trial=True,
                 queue_trials=True,
                 with_server=False,
                 **gan_spec)

    except KeyboardInterrupt:
        os._exists(1)

How could I fix this? Thx for your help : )

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Apr 7, 2019

hm, can you check that in the main file, status_reporter is not None? You can do this by turning verbose=2 in tune.run and then printing the arguments in the training loop.

0reactions
mahuangxucommented, Aug 28, 2021

Please check whether you are using the GPU version Pytorch or Tensorflow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fail to run ray tune with tensorflow and gpu - Stack Overflow
OS Platform and Distribution: Linux Ubuntu 16.04; Ray installed from (source or binary): binary; Ray version: 0.6.5; Python version: 3.6.
Read more >
Tune.run() with docker is not using gpu - Ray
High: It blocks me to complete my task. Hi, tune.run() does not recognise the GPU for training. I am using tensorflow with ray....
Read more >
Use a GPU | TensorFlow Core
TensorFlow supports running computations on a variety of types of devices, including CPU and GPU. They are represented with string identifiers for example:....
Read more >
Getting started with TensorFlow large model support - IBM
If the model is unable to run within GPU memory while using a swapout_threshold of 1, the next step is to begin enabling...
Read more >
TensorFlow User Guide - NVIDIA Documentation Center
This guide also provides documentation on the NVIDIA TensorFlow ... The nvidia-docker images come prepackaged, tuned, and ready to run; ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found