[tune] PBT causes task reconstruction messages
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): binary from latest installed 11/7
- Ray version: 0.5.3
- Python version: 2.7.15
Describe the problem
Getting excessive ray logging to stdout
about reconstructing task ...
when running PBT. There seems to be relatively less spam early on, and more as training progresses.
An example of this line is: I1111 00:40:52.240785 29181 node_manager.cc:1422] Reconstructing task 00000000b85b590f361da6f18d3143dabf82655a on client 94396e7715cc5f95bfbccf545a7e14591dacc7c1
With 4 trials, 200 total epochs, and a perturb interval of 10 iterations: At 0 epochs: 1 line per iteration per trial At 10 epochs, 1 checkpoint, 0 perturb: now 2 lines / iter / trial At 20 epochs: 3-5 lines At 2X epochs: ~10 lines At 190 epochs, 27 checkpoints, 25 perturbs: About 100 lines / iter / trial The total log size is about 14MB, about 100k total lines.
With 8 trials running the same thing, size increases to about 48MB, 345K total lines. Also, for some reason there’s 47 checkpoints, and 25 perturbs (4 trials had 27 checkpoints, 25 perturbs).
Source code / logs
I can attach logs if they would be helpful, but they are very big.
Saving and restoring is done with a tf.Saver
object, calling saver.save()
and saver.restore()
Configs are:
train_spec = {
"run": RayModel,
"trial_resources": {
"cpu": 8,
"gpu": 1
},
"stop": {
"training_iteration": hparams.num_epochs,
},
"config": hparams.values(),
"local_dir": FLAGS.local_dir,
"checkpoint_freq": FLAGS.checkpoint_freq,
"num_samples": FLAGS.num_samples
}
ray.init()
pbt = PopulationBasedTraining(
time_attr="training_iteration",
reward_attr="val_acc",
perturbation_interval=FLAGS.perturbation_interval,
m_explore_fn=explore)
run_experiments({"autoaug_pbt": train_spec}, scheduler=pbt, verbose=False)
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Ok I think I know the proximate cause of this:
In tune, when we PAUSE a trial, we call runner.save_to_object.remote(): https://github.com/ray-project/ray/blob/65c27c70cf88094c3facc95a0566aaf4159cb46e/python/ray/tune/ray_trial_executor.py#L282 This is followed up by __ray_terminate__ing the runner.
Later on, we pass that future returned by save_to_object to restore_from_remote(): https://github.com/ray-project/ray/blob/65c27c70cf88094c3facc95a0566aaf4159cb46e/python/ray/tune/ray_trial_executor.py#L301 However that seems to somehow trigger the reconstructing message.
The message goes away if you add a ray.get() after save_to_object.remote() (i.e.
ray.get(trial._checkpoint.value)
, which forces the ray terminate to occur after the method has completed. Though I’m not sure why that should matter other than there being some bug in the backend about getting items after the actor has terminated, even if the actor did put the object successfully. @stephanie-wang any ideas ?This issue is quite specific though and fairly harmless, so not a release blocker.
I noticed that it doesn’t happen when all trials fit inside compute resources and runs don’t get paused.