Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fail to run tune with tensorflow and gpu

See original GitHub issue

System information

OS Platform and Distribution: Linux Ubuntu 16.04
Ray installed from (source or binary): binary
Ray version: 0.6.5
Python version: 3.6

Describe the problem

I am trying to use ray with tensorflow following the tutorial (link) And I got a tune error:

error log


Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/1 GPUs
Memory usage on this node: 53.0/67.5 GB
Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    **gan_spec)
  File "/lib/python3.6/site-packages/ray/tune/tune.py", line 253, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_gan_0_partition=0, train_gan_1_partition=1])

Source code / logs

The code related to ray use:

# !!! Entrypoint for ray.tune !!!
def train(config={'partition': 0}, reporter=None):
    global status_reporter, partition_fn
    status_reporter = reporter
    partition_fn = config['partition']
    tf.app.run(main=main)


# !!! Example of using the ray.tune Python API !!!
if __name__ == "__main__":
    try:
        register_trainable('train_gan', train)
        gan_spec = {
            'stop': {
                'time_total_s': 600,
            },
            'config': {
                'partition': grid_search([0, 1]),
            },
        }

        ray.init()

        tune.run('train_gan',
                 name='tune_gan_test',
                 resources_per_trial={"gpu":1},
                 raise_on_failed_trial=True,
                 queue_trials=True,
                 with_server=False,
                 **gan_spec)

    except KeyboardInterrupt:
        os._exists(1)

How could I fix this? Thx for your help : )

Issue Analytics

State:
Created 4 years ago
Comments:12 (4 by maintainers)

Top GitHub Comments

1reaction

richardliawcommented, Apr 7, 2019

hm, can you check that in the main file, status_reporter is not None? You can do this by turning verbose=2 in tune.run and then printing the arguments in the training loop.

0reactions

mahuangxucommented, Aug 28, 2021

Please check whether you are using the GPU version Pytorch or Tensorflow.

Top Results From Across the Web

Fail to run ray tune with tensorflow and gpu - Stack Overflow

OS Platform and Distribution: Linux Ubuntu 16.04; Ray installed from (source or binary): binary; Ray version: 0.6.5; Python version: 3.6.

Tune.run() with docker is not using gpu - Ray

High: It blocks me to complete my task. Hi, tune.run() does not recognise the GPU for training. I am using tensorflow with ray....

Use a GPU | TensorFlow Core

TensorFlow supports running computations on a variety of types of devices, including CPU and GPU. They are represented with string identifiers for example:....

Getting started with TensorFlow large model support - IBM

If the model is unable to run within GPU memory while using a swapout_threshold of 1, the next step is to begin enabling...

TensorFlow User Guide - NVIDIA Documentation Center

This guide also provides documentation on the NVIDIA TensorFlow ... The nvidia-docker images come prepackaged, tuned, and ready to run; ...