Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] Fail to run tune with pytorch

See original GitHub issue

System information

**OS Platform and Distribution **: Linux Ubuntu 16.04
Ray installed from (source or binary): binary
**Ray version **: 0.6.5
**Python version **: 3.6

Describe the problem

I am trying to use ray with pytorch following the example of bayesopt_example.py provided by tune. Note that the bayesopt_example.py can run successively. I used the function-based API and reporter was conducted within my function.

But, I got a tune error:

Source code

def trainable_main(config,reporter):
args = arg_parse() vars(args).update(config) …(run my model) reporter(neg_mean_loss=neg_mean_loss)

if name == “main”:

parser = argparse.ArgumentParser()
parser.add_argument(
    "--smoke-test", action="store_true", help="Finish quickly for testing")
args, _ = parser.parse_known_args()
ray.shutdown()
ray.init()

space = {"lr": (0.0001, 0.01), "dropout": (0.1, 0.9)}

config = {
    "num_samples": 10 if args.smoke_test else 10,
    "stop": {
        "training_iteration": 2
    },
    "resources_per_trial": {
            "cpu": 0,
            "gpu": 1
    }
}
algo = BayesOptSearch(
    space,
    max_concurrent=4,
    reward_attr="neg_mean_loss",
    utility_kwargs={
        "kind": "ucb",
        "kappa": 2.5,
        "xi": 0.0
    })
scheduler = AsyncHyperBandScheduler(reward_attr="neg_mean_loss")
run(trainable_main,
    name="my_exp",
    search_alg=algo,
    scheduler=scheduler,
    **config)

error log

2019-05-15 10:31:53,733 WARNING worker.py:1337 – WARNING: Not updating worker name since setproctitle is not installed. Install this with pip install setproctitle (or ray[debug]) to enable monitoring of worker processes. 2019-05-15 10:31:53,734 INFO node.py:469 – Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-15_10-31-53_2668/logs. 2019-05-15 10:31:53,889 INFO services.py:407 – Waiting for redis server at 127.0.0.1:41813 to respond… 2019-05-15 10:31:54,079 INFO services.py:407 – Waiting for redis server at 127.0.0.1:22652 to respond… 2019-05-15 10:31:54,087 INFO services.py:804 – Starting Redis shard with 6.74 GB max memory. 2019-05-15 10:31:54,230 INFO node.py:483 – Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-15_10-31-53_2668/logs. 2019-05-15 10:31:54,231 INFO services.py:1427 – Starting the Plasma object store with 10.11 GB memory using /dev/shm. 2019-05-15 10:31:54,690 INFO tune.py:60 – Tip: to resume incomplete experiments, pass resume=‘prompt’ or resume=True to run() 2019-05-15 10:31:54,692 INFO tune.py:211 – Starting a new experiment. == Status == Using AsyncHyperBand: num_stopped=0 Bracket: Iter 90.000: None | Iter 30.000: None | Iter 10.000: None Bracket: Iter 90.000: None | Iter 30.000: None Bracket: Iter 90.000: None Resources requested: 0/8 CPUs, 0/1 GPUs Memory usage on this node: 12.7/33.7 GB

== Status == Using AsyncHyperBand: num_stopped=0 Bracket: Iter 90.000: None | Iter 30.000: None | Iter 10.000: None Bracket: Iter 90.000: None | Iter 30.000: None Bracket: Iter 90.000: None Resources requested: 0/8 CPUs, 1/1 GPUs Memory usage on this node: 12.7/33.7 GB Result logdir: /home/qinjian/ray_results/my_exp Number of trials: 4 ({‘RUNNING’: 1, ‘PENDING’: 3}) PENDING trials:

trainable_main_2_dropout=0.10009,lr=0.0030931: PENDING
trainable_main_3_dropout=0.2174,lr=0.0010142: PENDING
trainable_main_4_dropout=0.24901,lr=0.0035211: PENDING RUNNING trials:
trainable_main_1_dropout=0.43362,lr=0.0072312: RUNNING

(pid=10879) usage: default_worker.py [-h] [–dataset DATASET] [–bmname BMNAME] (pid=10879) [–pkl PKL_FNAME] [–assign-ratio ASSIGN_RATIO] (pid=10879) [–num-pool NUM_POOL] [–linkpred] (pid=10879) [–datadir DATADIR] [–logdir LOGDIR] [–cuda CUDA] (pid=10879) [–max-nodes MAX_NODES] [–lr LR] [–clip CLIP] (pid=10879) [–batch-size BATCH_SIZE] [–epochs NUM_EPOCHS] (pid=10879) [–train-ratio TRAIN_RATIO] (pid=10879) [–num_workers NUM_WORKERS] [–feature FEATURE_TYPE] (pid=10879) [–input-dim INPUT_DIM] [–hidden-dim HIDDEN_DIM] (pid=10879) [–output-dim OUTPUT_DIM] [–num-classes NUM_CLASSES] (pid=10879) [–num-gc-layers NUM_GC_LAYERS] [–nobn] (pid=10879) [–dropout DROPOUT] [–nobias] [–method METHOD] (pid=10879) [–name-suffix NAME_SUFFIX] (pid=10879) default_worker.py: error: unrecognized arguments: --node-ip-address=172.20.10.7 --object-store-name=/tmp/ray/session_2019-05-15_10-31-53_2668/sockets/plasma_store --raylet-name=/tmp/ray/session_2019-05-15_10-31-53_2668/sockets/raylet --redis-address=172.20.10.7:41813 --temp-dir=/tmp/ray/session_2019-05-15_10-31-53_2668 2019-05-15 10:31:57,573 ERROR trial_runner.py:494 – Error processing event. Traceback (most recent call last): File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 443, in _process_trial result = self.trial_executor.fetch_result(trial) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py”, line 315, in fetch_result result = ray.get(trial_future[0]) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/worker.py”, line 2193, in get raise value ray.exceptions.RayTaskError: ray_worker (pid=10879, host=qinjian-MS-7A72) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/trainable.py”, line 151, in train result = self._train() File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/function_runner.py”, line 203, in _train ("Wrapped function ran until completion without reporting " ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

richardliawcommented, May 15, 2019

Can you move the arg_parse() out of the function?

args = arg_parse()

def trainable_main(config,reporter):
    vars(args).update(config)
    .....(run my model)
    reporter(neg_mean_loss=neg_mean_loss)

0reactions

richardliawcommented, Jul 2, 2019

Closing this issue because looks like the error is a BayesOpt error “KeyError: ‘Data point [9.e-01 1.e-04] is not unique’”.

Top Results From Across the Web

How to use Tune with PyTorch - the Ray documentation

In this walkthrough, we will show you how to integrate Tune into your PyTorch training workflow. We will follow this tutorial from the...

Pytorch and ray tune: why the error; raise TuneError("Trials did ...

Could someone show me where I'm going wrong, how to I run HPO with tune in this network and then train the model...

Hyperparameter tuning an LSTM - PyTorch Forums

It seems the code is failing with: TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to...

Logging scalars when running ray[tune] tuning fails in a guild ...

So I went into the ray.tune library and I changed the call to self._trial_writer[trial].add_scalar(full_attr, value, step) and reran it. The failure went away....

Hyperparameter Tuning with PyTorch and Ray Tune

In this tutorial, we are going to explore hyperparameter tuning using PyTorch and Ray Tune and try to obtain the best neural network...