[tune] Fail to run tune with pytorch
See original GitHub issueSystem information
- **OS Platform and Distribution **: Linux Ubuntu 16.04
- Ray installed from (source or binary): binary
- **Ray version **: 0.6.5
- **Python version **: 3.6
Describe the problem
I am trying to use ray with pytorch following the example of bayesopt_example.py provided by tune. Note that the bayesopt_example.py can run successively. I used the function-based API and reporter was conducted within my function.
But, I got a tune error:
Source code
def trainable_main(config,reporter):
args = arg_parse()
vars(args).update(config)
…(run my model)
reporter(neg_mean_loss=neg_mean_loss)
if name == “main”:
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
args, _ = parser.parse_known_args()
ray.shutdown()
ray.init()
space = {"lr": (0.0001, 0.01), "dropout": (0.1, 0.9)}
config = {
"num_samples": 10 if args.smoke_test else 10,
"stop": {
"training_iteration": 2
},
"resources_per_trial": {
"cpu": 0,
"gpu": 1
}
}
algo = BayesOptSearch(
space,
max_concurrent=4,
reward_attr="neg_mean_loss",
utility_kwargs={
"kind": "ucb",
"kappa": 2.5,
"xi": 0.0
})
scheduler = AsyncHyperBandScheduler(reward_attr="neg_mean_loss")
run(trainable_main,
name="my_exp",
search_alg=algo,
scheduler=scheduler,
**config)
error log
2019-05-15 10:31:53,733 WARNING worker.py:1337 – WARNING: Not updating worker name since setproctitle
is not installed. Install this with pip install setproctitle
(or ray[debug]) to enable monitoring of worker processes.
2019-05-15 10:31:53,734 INFO node.py:469 – Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-15_10-31-53_2668/logs.
2019-05-15 10:31:53,889 INFO services.py:407 – Waiting for redis server at 127.0.0.1:41813 to respond…
2019-05-15 10:31:54,079 INFO services.py:407 – Waiting for redis server at 127.0.0.1:22652 to respond…
2019-05-15 10:31:54,087 INFO services.py:804 – Starting Redis shard with 6.74 GB max memory.
2019-05-15 10:31:54,230 INFO node.py:483 – Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-15_10-31-53_2668/logs.
2019-05-15 10:31:54,231 INFO services.py:1427 – Starting the Plasma object store with 10.11 GB memory using /dev/shm.
2019-05-15 10:31:54,690 INFO tune.py:60 – Tip: to resume incomplete experiments, pass resume=‘prompt’ or resume=True to run()
2019-05-15 10:31:54,692 INFO tune.py:211 – Starting a new experiment.
== Status ==
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 90.000: None | Iter 30.000: None | Iter 10.000: None
Bracket: Iter 90.000: None | Iter 30.000: None
Bracket: Iter 90.000: None
Resources requested: 0/8 CPUs, 0/1 GPUs
Memory usage on this node: 12.7/33.7 GB
== Status == Using AsyncHyperBand: num_stopped=0 Bracket: Iter 90.000: None | Iter 30.000: None | Iter 10.000: None Bracket: Iter 90.000: None | Iter 30.000: None Bracket: Iter 90.000: None Resources requested: 0/8 CPUs, 1/1 GPUs Memory usage on this node: 12.7/33.7 GB Result logdir: /home/qinjian/ray_results/my_exp Number of trials: 4 ({‘RUNNING’: 1, ‘PENDING’: 3}) PENDING trials:
- trainable_main_2_dropout=0.10009,lr=0.0030931: PENDING
- trainable_main_3_dropout=0.2174,lr=0.0010142: PENDING
- trainable_main_4_dropout=0.24901,lr=0.0035211: PENDING RUNNING trials:
- trainable_main_1_dropout=0.43362,lr=0.0072312: RUNNING
(pid=10879) usage: default_worker.py [-h] [–dataset DATASET] [–bmname BMNAME] (pid=10879) [–pkl PKL_FNAME] [–assign-ratio ASSIGN_RATIO] (pid=10879) [–num-pool NUM_POOL] [–linkpred] (pid=10879) [–datadir DATADIR] [–logdir LOGDIR] [–cuda CUDA] (pid=10879) [–max-nodes MAX_NODES] [–lr LR] [–clip CLIP] (pid=10879) [–batch-size BATCH_SIZE] [–epochs NUM_EPOCHS] (pid=10879) [–train-ratio TRAIN_RATIO] (pid=10879) [–num_workers NUM_WORKERS] [–feature FEATURE_TYPE] (pid=10879) [–input-dim INPUT_DIM] [–hidden-dim HIDDEN_DIM] (pid=10879) [–output-dim OUTPUT_DIM] [–num-classes NUM_CLASSES] (pid=10879) [–num-gc-layers NUM_GC_LAYERS] [–nobn] (pid=10879) [–dropout DROPOUT] [–nobias] [–method METHOD] (pid=10879) [–name-suffix NAME_SUFFIX] (pid=10879) default_worker.py: error: unrecognized arguments: --node-ip-address=172.20.10.7 --object-store-name=/tmp/ray/session_2019-05-15_10-31-53_2668/sockets/plasma_store --raylet-name=/tmp/ray/session_2019-05-15_10-31-53_2668/sockets/raylet --redis-address=172.20.10.7:41813 --temp-dir=/tmp/ray/session_2019-05-15_10-31-53_2668 2019-05-15 10:31:57,573 ERROR trial_runner.py:494 – Error processing event. Traceback (most recent call last): File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/trial_runner.py”, line 443, in _process_trial result = self.trial_executor.fetch_result(trial) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py”, line 315, in fetch_result result = ray.get(trial_future[0]) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/worker.py”, line 2193, in get raise value ray.exceptions.RayTaskError: ray_worker (pid=10879, host=qinjian-MS-7A72) File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/trainable.py”, line 151, in train result = self._train() File “/home/qinjian/anaconda3/lib/python3.6/site-packages/ray/tune/function_runner.py”, line 203, in _train ("Wrapped function ran until completion without reporting " ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Can you move the
arg_parse()
out of the function?Closing this issue because looks like the error is a BayesOpt error “KeyError: ‘Data point [9.e-01 1.e-04] is not unique’”.