question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enabling engine to run single epochs

See original GitHub issue

🚀 Feature

Problem

I am using multiple engines in a nested way. That means, that if e.g. the main engine fires Events.EPOCH_COMPLETED, another child engine is attached to this event and shall run only one epoch. A solution would be to run the child engine with engine.run(max_epochs=1) but then, the engine fires setup and teardown events like Events.STARTED and Events.COMPLETED each time I call engine.run(max_epochs=1) even though those events are for the purpose to only be fired one time, as far as I understand. Since my child engine must setup and teardown things, I could attach event handlers to the main engine, but the handlers I want to attach do not know that a main engine exists. The handlers shouldn’t have any access to the main engine.

Solution

I need some functionality that the engine can do the following (This is just an example with a bad but possible way of implementing this):

engine.run_epoch(max_epochs=3)  # runs setup and first epoch, fires events from `STARTED` to `EPOCH_COMPLETED` 
engine.run_epoch(max_epochs=3) # runs second epoch, fires events from `EPOCH_STARTED`to `EPOCH_COMPLETED` 
engine.run_epoche(max_epochs=3) # runs last epoch and teardown, fires events from `EPOCH_STARTED` to `COMPLETED`

Instead of calling a function, one could create an iterable object from engine.run and get the same behavior in a nicer way:

epoch_iterator = iterable_engine.run(max_epochs=3)
next(epoch_iterator)  # runs setup and first episode, fires events from `STARTED` to `EPOCH_COMPLETED` 
next(epoch_iterator)  # runs second episode, fires events from `EPOCH_STARTED`to `EPOCH_COMPLETED` 
next(epoch_iterator)  # runs last episode and teardown, fires events from `EPOCH_STARTED` to `COMPLETED`

Or one can use loops:

iterable_engine = IterableEngine(lambda x, y: 0.)
iterable_engine.add_event_handler(Events.STARTED, lambda x: print("started"))
iterable_engine.add_event_handler(Events.EPOCH_STARTED, lambda x: print("epoch started"))
iterable_engine.add_event_handler(Events.EPOCH_COMPLETED, lambda x: print("epoch completed"))
iterable_engine.add_event_handler(Events.COMPLETED, lambda x: print("completed"))

epoch_iterator = iterable_engine.run([1], max_epochs=3)
for state in epoch_iterator:
    print("This is outside engine.run")

The output is:

started
epoch started
epoch completed
This is outside engine.run
epoch started
epoch completed
This is outside engine.run
epoch started
epoch completed
This is outside engine.run
completed

I added the code at the bottom where I subclass from Engine and overload the _internal_run method with a copy of the original method and added one line, where I add the yield statement. You can execute it and it outputs the example. To switch between the actual and this behavior, one could put yield into an if statement and pass an additional argument to engine.run, e.g. engine.run(max_epochs=3, return_generator=True) or set a flag of the engine to enable this functionality.

What do you think?

Code:

import time

from ignite._utils import _to_hours_mins_secs
from ignite.engine import Engine
from ignite.engine import Events
from ignite.engine import State


class IterableEngine(Engine):
    def _internal_run(self) -> State:
        self.should_terminate = self.should_terminate_single_epoch = False
        self._init_timers(self.state)
        try:
            start_time = time.time()
            self._fire_event(Events.STARTED)
            while self.state.epoch < self.state.max_epochs and not self.should_terminate:
                self.state.epoch += 1
                self._fire_event(Events.EPOCH_STARTED)

                if self._dataloader_iter is None:
                    self._setup_engine()

                time_taken = self._run_once_on_dataset()
                # time is available for handlers but must be update after fire
                self.state.times[Events.EPOCH_COMPLETED.name] = time_taken
                handlers_start_time = time.time()
                if self.should_terminate:
                    self._fire_event(Events.TERMINATE)
                else:
                    self._fire_event(Events.EPOCH_COMPLETED)
                time_taken += time.time() - handlers_start_time
                # update time wrt handlers
                self.state.times[Events.EPOCH_COMPLETED.name] = time_taken
                hours, mins, secs = _to_hours_mins_secs(time_taken)
                self.logger.info(
                    "Epoch[%s] Complete. Time taken: %02d:%02d:%02d" % (self.state.epoch, hours, mins, secs)
                )
                if self.should_terminate:
                    break
                yield self.state

            time_taken = time.time() - start_time
            # time is available for handlers but must be update after fire
            self.state.times[Events.COMPLETED.name] = time_taken
            handlers_start_time = time.time()
            self._fire_event(Events.COMPLETED)
            time_taken += time.time() - handlers_start_time
            # update time wrt handlers
            self.state.times[Events.COMPLETED.name] = time_taken
            hours, mins, secs = _to_hours_mins_secs(time_taken)
            self.logger.info("Engine run complete. Time taken: %02d:%02d:%02d" % (hours, mins, secs))

        except BaseException as e:
            self._dataloader_iter = None
            self.logger.error("Engine run is terminating due to exception: %s.", str(e))
            self._handle_exception(e)

        self._dataloader_iter = None
        return self.state


if __name__ == '__main__':
    iterable_engine = IterableEngine(lambda x, y: 0.)
    iterable_engine.add_event_handler(Events.STARTED, lambda x: print("started"))
    iterable_engine.add_event_handler(Events.EPOCH_STARTED, lambda x: print("epoch started"))
    iterable_engine.add_event_handler(Events.EPOCH_COMPLETED, lambda x: print("epoch completed"))
    iterable_engine.add_event_handler(Events.COMPLETED, lambda x: print("completed"))

    epoch_iterator = iterable_engine.run([1], max_epochs=3)
    for state in epoch_iterator:
        print("This is outside engine.run")

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
alxlampecommented, Oct 16, 2020

@vfdev-5 I did some experiments with the example from my last post and there are some issues: First, yield can not be used in an if statement:

if return_generator:
    yield self.state

This does not work as expected. Even if return_generator==False, the function returns an iterator.

Second, if I have two doubly nested engines, (two of child_child_engine in the example from above), the event STARTED is not in sync with the parent engines. This means, that the epoch of the first child_child_engine runs before the second child_child_engine fires STARTED.

EVENT: STARTED
	EVENT: STARTED
	EVENT: EPOCH_STARTED
		 EVENT: STARTED
		 EVENT: EPOCH_STARTED <- not intended*
		 EVENT: EPOCH_COMPLETED
		 EVENT: STARTED
		 EVENT: EPOCH_STARTED
		 EVENT: EPOCH_COMPLETED
	EVENT: EPOCH_COMPLETED
EVENT: EPOCH_COMPLETED
	EVENT: EPOCH_STARTED
		 EVENT: EPOCH_STARTED
		 EVENT: EPOCH_COMPLETED
		 EVENT: EPOCH_STARTED
		 EVENT: EPOCH_COMPLETED
	EVENT: EPOCH_COMPLETED
EVENT: EPOCH_COMPLETED
		 EVENT: COMPLETED
		 EVENT: COMPLETED
	EVENT: COMPLETED
EVENT: COMPLETED

*event STARTED of 2nd child_child_engine should be fired before epoch of 1st child child engine runs an epoch

I’ve created a gist here with an engine that implements the following methods:

  • setup_run does the setup and fires Events.STARTED
  • run_epoch run exactly one epoch and fires everthing in between Events.EPOCH_STARTED and Events.EPOCH_COMPLETED. If max_epochs is reached, it also runs finish_run.
  • finish_run runs the teardown part and fires Events.COMPLETED.

This engine is attachable or nestable with attach_to_engine. It adds event handlers to events of the parent engine, to fire one time events (STARTED, COMPLETED,…) in sync with the parent engine. In this example, the run_epoch method is attached to parents EPOCH_COMPLETED. This enables to run the nested engines without any changes to the parent engine.

The output is the following:

EVENT: STARTED (parent_engine)
	EVENT: STARTED (nested_engine)
		EVENT: STARTED (doubly_nested_engine1)
		EVENT: STARTED (doubly_nested_engine2)
EVENT: EPOCH_STARTED (parent_engine)
EVENT: EPOCH_COMPLETED (parent_engine)
	EVENT: EPOCH_STARTED (nested_engine)
	EVENT: EPOCH_COMPLETED (nested_engine)
		EVENT: EPOCH_STARTED (doubly_nested_engine1)
		EVENT: EPOCH_COMPLETED (doubly_nested_engine1)
		EVENT: EPOCH_STARTED (doubly_nested_engine2)
		EVENT: EPOCH_COMPLETED (doubly_nested_engine2)
EVENT: EPOCH_STARTED (parent_engine)
EVENT: EPOCH_COMPLETED (parent_engine)
	EVENT: EPOCH_STARTED (nested_engine)
	EVENT: EPOCH_COMPLETED (nested_engine)
		EVENT: EPOCH_STARTED (doubly_nested_engine1)
		EVENT: EPOCH_COMPLETED (doubly_nested_engine1)
		EVENT: EPOCH_STARTED (doubly_nested_engine2)
		EVENT: EPOCH_COMPLETED (doubly_nested_engine2)
EVENT: COMPLETED (parent_engine)
	EVENT: COMPLETED (nested_engine)
		EVENT: COMPLETED (doubly_nested_engine1)
		EVENT: COMPLETED (doubly_nested_engine2)

Another case would be, if the parent engine only serves to drive the run of it’s child engines. Then one could create the parent engine with a process function that runs epochs of the child engines:

def _process_child_engines(engine, child_engines):
    for child_engine in child_engines:
        child_engine.run_epoch()

This could also solve the problem in #1384 (Step 2 in the discussion). The advantage would be that child_engine can also be standalone, so that it can run training using child_engine.run(...) and without making any changes to metrics handlers, loggers etc… In the use case of #1384, the only thing that must be added then is some kind of MetricSummaryHandler that is attached to the parent engine and summarizes the metrics of all child engines.

How could it look like: I am not very familiar with k-fold cross validation but this seems to be a good example. The code below doesn’t work but should give some intuition about the idea:

child_engines = []
for k in range(num_k):
   k_fold_data_loader = get_k_fold_data(k, num_k)
    # setup_engine is a function that sets up the training process for one engine with all it's metrics loggers etc.
    engine = setup_engine(data=k_fold_data_loader)  # function takes arguments like engine.run, stores k_fold_data_loader
    child_engines.append(engine)

serving_engine = Engine(_process_child_engines) # process function from above
# attach childs to have one time events synchronized with serving engine
for child_engine in child_engines:
    child_engine.attach_to_parent_engine(serving_engine)

# add metrics summarizer
metrics_summarizer = MetricsSummarizer(child_engines, ...)  # add childs to summarizer
metrics_summarizer.attach(serving_engine) # adds event handler to epoch_completed to summarize metrics after each epoch


serving_engine.run(data=child_engines, max_epochs=100)

Some more info about my use case: I am using this in the field of RL, where I have multiple tasks (child engines). Each task engine runs the agent in a specific environment. E.g. in evaluation, I am running 20 different tasks (20 specific environment setups) and compute summarized metrics based on the results of 20 runs. Furthermore, I have added task groups because some tasks fall in the same category, e.g. 5 different tasks of group walking, 10 different tasks of running etc.

I hope that gives some insights and shows, that it is a useful feature 😃 One can even think of extending this to enable the child engines to run in parallel, which is what I am planning in the future for my experiments. If you’re interested, we could have some more discussion on what I’ve already implemented (e.g. groups and group metrics handler).

Another use case that fits nicely into this framework is to run an experiment with multiple seeds at the same time and compute metrics summary (mean, min, max) on the fly.

1reaction
vfdev-5commented, Oct 6, 2020

@alxlampe interesting idea, thanks !

If I understand correctly, what you would like to achieve, it can be done with some events filtering:

from ignite.engine import Engine, Events

engine = Engine(lambda e, b: None)

def once_at_start(engine, _):
    return engine.state.epoch == 0

def once_at_end(engine, _):
    return engine.state.epoch == 10

engine.add_event_handler(Events.STARTED(once_at_start), lambda x: print("started"))
engine.add_event_handler(Events.EPOCH_STARTED, lambda x: print("{} epoch started".format(x.state.epoch)))
engine.add_event_handler(Events.EPOCH_COMPLETED, lambda x: print("{} epoch completed".format(x.state.epoch)))
engine.add_event_handler(Events.COMPLETED(once_at_end), lambda x: print("completed"))

engine.run([0, 1, 2], max_epochs=3)
print("Do something else")
engine.run([0, 1, 2], max_epochs=6)
print("Do something else")
engine.run([0, 1, 2], max_epochs=10)

gives

started
1 epoch started
1 epoch completed
2 epoch started
2 epoch completed
3 epoch started
3 epoch completed
Do something else
4 epoch started
4 epoch completed
5 epoch started
5 epoch completed
6 epoch started
6 epoch completed
Do something else
7 epoch started
7 epoch completed
8 epoch started
8 epoch completed
9 epoch started
9 epoch completed
10 epoch started
10 epoch completed
completed

Can it be generalized to your use-case or it is too urgly and specific. What do you think ?

Since my child engine must setup and teardown things, I could attach event handlers to the main engine, but the handlers I want to attach do not know that a main engine exists. The handlers shouldn’t have any access to the main engine.

Maybe, this requirement is not satisfied with above code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Engine — PyTorch-Ignite v0.4.10 Documentation
Runs a given process_function over each batch of a dataset, emitting events as it goes. ... Engine implementation has changed in v0.4.10 with...
Read more >
Configuring Epochs [Cisco IOS XE 3S]
Perform the following task to begin a new epoch and increment the epoch number of one or all of the Cisco Express Forwarding...
Read more >
Trainer — PyTorch Lightning 1.8.5.post0 documentation
Under the hood, the Lightning Trainer handles the training loop details for you, some examples include: Automatically enabling/disabling grads. Running the ...
Read more >
Model training APIs - Keras
At most, one full epoch will be run each execution. If a number larger than the size of the epoch is passed, the...
Read more >
Difference Between a Batch and an Epoch in a Neural Network
One epoch means that each sample in the training dataset has had an ... allowing the learning algorithm to run until the error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found