question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

set global_step from checkpoint

See original GitHub issue

❓ Questions/Help/Support

Hello, I wonder how am I suppose to set the global_step from checkpoint for handlers?

Specifically, i would like to have most of the handlers in tensorboard logger using engine.state.iteration as global step. Since engine.state.iteration is reset when i do engine.run. I had to create my own counter (like the following), which basically add an offset to current iteration, and pass it to all the global_step_transform arguments in output handler.

class global_step_counter:
    def __init__(self, resume_from_iter):
        self.resume_from_iter = resume_from_iter
        return 
    def set_starting_val(self, new_val):
        self.resume_from_iter = new_val
        return 
    def __call__(self, engine):
        return engine.state.iteration + self.resume_from_iter

However, i realize the following handlers does not provide global_step_transform option

  • OptimizerParamsHandler
  • WeightsScalarHandler
  • GradsScalarHandler
  • WeightsHistHandler
  • GradsHistHandler Instead, they use get_event_attrib_value to get the global_step, which would make this variable inconsistent with others

Thank you

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
ghostcommented, Mar 3, 2020

Thank you @vfdev-5 ! this solved my issue

1reaction
ghostcommented, Mar 3, 2020

Here is a minimal example @vfdev-5

from collections import OrderedDict
import numpy as np
from ignite.engine import Engine
init_state = OrderedDict([('seed', 12), ('epoch_length', 8), ('max_epochs', 1), ('iteration', 8)])

def dataset():
    yield np.random.random()

e = Engine(lambda x , b: b)
e.load_state_dict(init_state)
print(f'e.state before run: {e.state}')
e.run(dataset(), epoch_length=1)
print(f'e.state after run: {e.state}')

output

(.venv)  ◰³ .venv  ~/micro  python test.py                                                                                                              Wed Mar  4 00:01:33 2020
e.state before run: State:
	iteration: 8
	epoch: 1
	epoch_length: 8
	max_epochs: 1
	output: <class 'NoneType'>
	batch: <class 'NoneType'>
	metrics: <class 'dict'>
	dataloader: <class 'NoneType'>
	seed: 12

e.state after run: State:
	iteration: 1
	epoch: 1
	epoch_length: 1
	max_epochs: 1
	output: 0.15416284237967237
	batch: 0.15416284237967237
	metrics: <class 'dict'>
	dataloader: <class 'generator'>
	seed: 12

You can see the iteration is indeed reset

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to get the global_step when restoring checkpoints in ...
General pattern is to have a global_step variable to keep track of steps global_step = tf.Variable(0, name='global_step', trainable=False) ...
Read more >
Global step loaded by "load_from_check_point" is wrong
global step loaded by “load_from_check_point” is “0” whereas “torch.load” gives the correct global step… I'm not quite sure if it is a bug ......
Read more >
Model Checkpointing — DeepSpeed 0.8.0 documentation
Checkpoint tag used as a unique identifier for the checkpoint, global step is. used if not provided. Tag name must be the same...
Read more >
tf.train.Checkpoint | TensorFlow v2.11.0
It can be more robust to changes in the Python program, and helps to support restore-on-create for variables. Checkpoint objects have ...
Read more >
ignite.handlers — PyTorch-Ignite v0.4.1 Documentation
Checkpoint handler can be used to periodically save and load objects which have ... Default is None, global_step based on attached engine.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found