question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] What is the proper way to restore checkpoint for fine-tuning / rendering / evaluation of a trained agent based on example/multiagent_cartpole.py?

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): pip install ray
  • Ray version: 0.6.5
  • Python version: 3.6.2
  • Exact command to reproduce:

Describe the problem

Before my question, let me introduce my understanding of the checkpoint file system. (you can skip it and toward my question)

The codes in example/multiagent_cartpole.py produces a experiment_state-2019-04-03_00-47-28.json-like file and a directory PPO_experiment_name with a few .pkl, .json, .csv files in it.

The file system looks like:

- local_dir (say: "~/ray_results")
    - exp_name (say: "PPO")
        - checkpoints (say: experiment_state-2019-04-05_17-59-00.json)
        - directory (named like: PPO_cartpole_0_2019-04-05_18-28-0296h2tknq)
            - xxx.log
            - params.json
            - params.pkl (This is the file to store trained parameter, I guess?)
            - progress.csv
            - result.json

After one successful training, now we have a trained agent (Because I used one shared policy for all agent). We set the local_dir exactly the same as training. Then set the exp_name exactly as training too, namely PPO.

Now it’s my problem. The tune.run function take two arguments which looks like helpful for restoring.

“resume” argument

The resume argument, once set to True, will automatically search in local_dir/exp_name/ finding the most recent experiment_state-<date_time>.json.

The resume work well. After setting it to true, the restoring seems to be successful, but the program immediately terminated, as if it inherit the termination states from the checkpoint.

Here’s the log:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0/1 GPUs
Memory usage on this node: 4.3/16.7 GB
Result logdir: /home/SENSETIME/pengzhenghao/ray_results/PPO
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
 - PPO_tollgate_0:	TERMINATED, [12 CPUs, 1 GPUs], [pid=9214], 4846 s, 300 iter, 1320000 ts, 1.1e+03 rew

The printed reward is exactly what trained agent able to give, but I cannot continue to train this agent, even if I set the num_iters greater than the number of iterations in last training (namely 300).

What’s more, it seems impossible using the resume argument to specify a checkpoint given the exact filename.

In a nut shell, my question on the resume argument is:

  1. What’s the meaning for this argument? It seems like it’s only used for restore checkpoint from unexpected failures. Therefore, it cannot be used to restore specified checkpoint. Am I correct?

“restore” argument

After setting restore=<log_dir>, namely restore="./experiments", which is my log_dir, it turn out to be an error:

Traceback (most recent call last):
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 499, in restore
    ray.get(trial.runner.restore.remote(value))
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_PPOAgent:restore() (pid=28099, host=g114e1900387)
  File "xxx/anaconda3/envs/dev/lib/python3.6/site-packages/ray/tune/trainable.py", line 304, in restore
    with open(checkpoint_path + ".tune_metadata", "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './experiments.tune_metadata'

I have checked everywhere of this computer and there is no such a file ended with .tune_metadata. I am really confusing.

In short, what I am trying to do is:

  1. Restore the trained agent and continue it’s training with the same config.

  2. Restore the trained agent, retrieve the Policy network, and used in the same environment with rendering, in order to visualize it’s performance.

  3. Restore the trained agent as a pre-trained agent and modify the config, such as using more workers and GPU to training on cluster.

Could you please tell me what I should do?

(By the way, the document is really insufficient for thoroughly understanding the whole process of rllib. Nevertheless I still appreciate your guys for this excellent project, wish some day I can make some contribution too~)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:23
  • Comments:23 (15 by maintainers)

github_iconTop GitHub Comments

31reactions
pengzhenghaocommented, Apr 14, 2019

For the potential reader:

resume argument do nothing but continue the last unfinished task. In this mode, it’s no allowed to reset the num_iters.

restore argument take the path of the checkpoint file as input. Concretely, the file look like ~/ray_results/expname/envname_date_someothercodes/checkpoint_10/checkpoint-10. Note that the checkpoint files would only exist for those tune.run() executions with checkpoint_at_end=True or checkpoint_freq setting to non-zero value.

Using restore argument and taking the checkpoint from which you want to continue the experiment is the only way to enlarge the number of iterations of a finished or unfinished experiment.

Thank Eric for offering quick and kind responses!

10reactions
stefanbschneidercommented, Aug 7, 2020

Since I was searching for a simple way to load a trained agent and continue training with RLlib, and I only found this issue, here’s what I found & what’s the easiest way in my opinion:

ray.tune.run(PPOTrainer, config=myconfig, restore=path_to_trained_agent_checkpoint)

Ie, just set the path in the restore argument, that’s it! No need for a custom train function.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Serving RLlib Models — Ray 2.2.0
We will train and checkpoint a simple PPO model with the CartPole-v0 environment from gym . In this tutorial we simply write to...
Read more >
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
Running the above will return one of the checkpoints that get generated during training, as well as a command that you can use...
Read more >
Ray restore checkpoint in rllib
Ray saves a bunch of checkpoints during a call of agent.train(). How do I know which one is the checkpoint with the best...
Read more >
A Guide To Using Checkpoints — Ray 2.2.0
This topic is relevant to trial checkpoints. Tune stores checkpoints on the node where the trials are executed. If you are training on...
Read more >
Training (tune.Trainable, session.report) — Ray 2.2.0
You can save and load checkpoints in Ray Tune in the following manner: ... You can also implement checkpoint/restore using the Trainable Class...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found