question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] PBT causes task reconstruction messages

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): binary from latest installed 11/7
  • Ray version: 0.5.3
  • Python version: 2.7.15

Describe the problem

Getting excessive ray logging to stdout about reconstructing task ... when running PBT. There seems to be relatively less spam early on, and more as training progresses.

An example of this line is: I1111 00:40:52.240785 29181 node_manager.cc:1422] Reconstructing task 00000000b85b590f361da6f18d3143dabf82655a on client 94396e7715cc5f95bfbccf545a7e14591dacc7c1

With 4 trials, 200 total epochs, and a perturb interval of 10 iterations: At 0 epochs: 1 line per iteration per trial At 10 epochs, 1 checkpoint, 0 perturb: now 2 lines / iter / trial At 20 epochs: 3-5 lines At 2X epochs: ~10 lines At 190 epochs, 27 checkpoints, 25 perturbs: About 100 lines / iter / trial The total log size is about 14MB, about 100k total lines.

With 8 trials running the same thing, size increases to about 48MB, 345K total lines. Also, for some reason there’s 47 checkpoints, and 25 perturbs (4 trials had 27 checkpoints, 25 perturbs).

Source code / logs

I can attach logs if they would be helpful, but they are very big.

Saving and restoring is done with a tf.Saver object, calling saver.save() and saver.restore()

Configs are:


train_spec = {
    "run": RayModel,
    "trial_resources": {
        "cpu": 8,
        "gpu": 1
    },
    "stop": {
        "training_iteration": hparams.num_epochs,
    },
    "config": hparams.values(),
    "local_dir": FLAGS.local_dir,
    "checkpoint_freq": FLAGS.checkpoint_freq,
    "num_samples": FLAGS.num_samples
}
ray.init()
pbt = PopulationBasedTraining(
    time_attr="training_iteration",
    reward_attr="val_acc",
    perturbation_interval=FLAGS.perturbation_interval,
    m_explore_fn=explore)
run_experiments({"autoaug_pbt": train_spec}, scheduler=pbt, verbose=False)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Nov 18, 2018

Ok I think I know the proximate cause of this:

In tune, when we PAUSE a trial, we call runner.save_to_object.remote(): https://github.com/ray-project/ray/blob/65c27c70cf88094c3facc95a0566aaf4159cb46e/python/ray/tune/ray_trial_executor.py#L282 This is followed up by __ray_terminate__ing the runner.

Later on, we pass that future returned by save_to_object to restore_from_remote(): https://github.com/ray-project/ray/blob/65c27c70cf88094c3facc95a0566aaf4159cb46e/python/ray/tune/ray_trial_executor.py#L301 However that seems to somehow trigger the reconstructing message.

The message goes away if you add a ray.get() after save_to_object.remote() (i.e. ray.get(trial._checkpoint.value), which forces the ray terminate to occur after the method has completed. Though I’m not sure why that should matter other than there being some bug in the backend about getting items after the actor has terminated, even if the actor did put the object successfully. @stephanie-wang any ideas ?

This issue is quite specific though and fairly harmless, so not a release blocker.

0reactions
arceliencommented, Nov 18, 2018

I noticed that it doesn’t happen when all trials fit inside compute resources and runs don’t get paused.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ray.tune.schedulers.pbt — Ray 2.2.0 - the Ray documentation
This Tune PBT implementation considers all trials added as part of the PBT ... Allows config schedule to be reconstructed. require_attrs: Whether to...
Read more >
Ray Documentation - Read the Docs
The ability to compose tasks makes it easy to express interesting ... Reconstruction of evicted actor objects: If ray.get is called on an ......
Read more >
Fix List for DB2 Version 9.7 for Linux, UNIX and Windows - IBM
APAR Sev. Abstract IT06632 2 INSTANCE TRAPPED AT SQLRH_FREE_CSO WITH SIG#11 IT09175 2 DB2 CRASH AT SQLNGMEMBLOCKLOOKUP IT03500 3 LOW PERFORMANCE OF QUERIES ON ADMSINTABINFO
Read more >
MCAS Principal's Administration Manual Spring
B. Tasks to Complete BEFORE Test Administration—Winter 2022 . ... Engineering PBT Accommodation for Grades 3–8 and 10, and for High School.
Read more >
Differential Tuning to Visual Motion Allows Robust Encoding ...
The task is more complicated for some species of dragonflies that ... (C) projections for a 3D model reconstruction of a dragonfly LTC....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found