question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training State Centered Framework vs. Engine Centered Architecture

See original GitHub issue

Hi!

I was thinking for some time about the current architecture of Ignite and how to improve the workflow both, during application development (writing training scripts) and during feature development. When writing training scripts I could not code the individual training I wanted and when trying to code the feature for my training script I ended up writing infrastructure/architecture code instead of implementing the feature. There were some kind of restrictions I first couldn’t identify…

Nevertheless, after a while I found 2 twists, actually nothing big… so here I want to come up with a under-the-hood-framework to integrate into Ignite that solved all my problems and many open issues! Nice right?

With this framework integrated into Ignite you achieve an extreme nice overview during debugging, enhance Ignite to a rapid feature dev tool, can handle far more complex (individual) use cases while achieving a higher degree of automation at the same time and have quite some new features and many more possibilities for more syntactic sugar.

But now comes the… BUT as I tried to fix it, unfortunately I had to realize it won’t work without major revisions. For that I went a long way to really provide proof and facts - something you can play with to make up your opinion - before daring to suggest a major revision…

So, if you’re interested in a up&running “what-if-when”-Ignite version with the 2 twists below untwisted, please have a look at the repository and the documentation and leave me a feedback - I’d really like to know your opinion.

In case enough of you like it and could imagine integrating the framework into Ignite, I could pull/request the code on an experimental branch and we see how it goes from there. (Note: I just pushed it to another repository because as far as i know you cannot pull/request a new branch - which this definitely needs.)

Everything else you need to know you will find in the Ignite Framework repo and the docu. For bugs & questions, let me know, thx!

So, set up your first coffee & enjoy playing!

Teaser from the documentation

Two issues

I am a fan of Ignite and that’s why I’m trying to contribute, but I discovered 2 shortcomings in the architecture and the implementation, that caused me quite some restrictions and coding infrastructure instead of programming new features (what I actually wanted to do). The issues are:

  • Engine centered architecture: In current Ignite the Engine is the architectural center with the training state as attribute. The training state atttribute is a transient object that is only instantiated when the Engine is in run-mode and vanishes afterwards. Also the state holds only a selective fraction of all variables and parameters that make up the real training state. So Engine is a kind of static object and state is transient. This does not represent the reality of the training process. In reality the training starts with an initial state holding all variables, parameters including e.g. model variables, hyperparameters etc. which then are modified while the state goes through different transitions. The main transitions of the state are Engines (normally more than one). So the state should be the architectual center holding ALL variables, parameters, values, transitions etc. and the Engine is (just) the main trainsition of the state. This small twist causes quite some complications for features and APIs which are listed below.
  • Event is broken in many pieces: Currently an Event is an Enum that has to be explicitly fire_evented, and implicitly _fire_evented so the event_handlers handle further callbacks. The Event is always fired after some other training value has changed, e.g. the model output was updated ITERATION_COMPLETED is fired. Also if you want to fire a non-standard event, you first have to create it, register it at each Engine that is supposed to use it and then the firing has to be implemented… But in reality an Event is nothing more than a value change of a training state variable that triggers callbacks. So all these pieces above can be put together by implementing a state variable as a descriptor.

Improvements from an underlying framework

You will experience the improvements given by the framework when working on all 3 levels: application implementation, feature development and framework development. The separation of these working areas is already the first improvement. Try out the benifits in detail & hands-on for the first to levels in the Quickunderstanding Application and Quickunderstanding Feature Dev.

Before you go through the theoretically described enhancements these few no-comment-teasers of the training state in the debugger will give you nice insights what’s ahead. It shows the Ignite example mnist_with_tensorboard.py transferred to the framework architecture just before the engines are started:

teaser_state_in_debugger

And here the engine state.engines.trainer unfolded:

teaser_engine_in_debugger

Or setting up all the below Tensorboard charts with these two simple comands:

# Automatically identify and generate metric chats comparing the different engines
EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
# Automatically generate for each engine a summary of all metric charts
EnginesMetricsCharts(x_axes_refs=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)

By the way, if you had set up 10x more metrics and some more engines, these two command would not change to provide all comparative and single metric charts of all engines.

teaser_tensorboard

Soooo, if you’re intrested, then grab a coffee and press >>>PLAY<<<!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:17 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
justusschockcommented, Feb 27, 2020

I really think the best way is to move forward and start implementing parts of your proposals. I’m sure we’ll encounter some problems during that, but I think this is the only way to really get an overview here. Regarding the namespace: I think we should carefully think about what we’re adding. So I would not like to add everything just because we can. But I agree that some stuff would be beneficial. But I think these are too many details to discuss them all without an implementation proposal.

And this is also why I suggested the implementation order above: we start with the easy things that would most likely be a good idea and move on to the things we need to discuss base on these implementation details

2reactions
justusschockcommented, Feb 26, 2020

Hi,

First of all: thank you very much for your work, this looks awesome!

I had a look at your code and what you did there is impressive. I have a few things to comment on:

1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don’t like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.

2.) Regarding the events: You’re right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don’t need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End…) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python’s core philosophy that explicit is better than implicit.

3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.

In general I really like your points (maybe not each part of the implementation, but that’s okay I guess). So what I would ultimately propose:

We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:

I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won’t be global

II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.

III) After this is done, we will most likely have to revisit your framework and look, what’s still missing from there and if it makes sense to actually include this here.

For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can. What do you think @DrStoop ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cloud Architecture Center | Google Cloud
Cloud Architecture Center. Discover reference architectures, guidance, and best practices for building or migrating your workloads on Google Cloud.
Read more >
SOA vs. Microservices: What's the Difference? - IBM
In this article, we'll explain the basics of service-oriented architecture (SOA) and microservices, touch on their key differences and look ...
Read more >
The Expanded Evidence-Centered Design (e-ECD ... - Frontiers
Evidence-centered design (ECD) is a framework for the design and development of assessments that ensures consideration and collection of ...
Read more >
Serve and Return
Serve and return interactions shape brain architecture. ... When an infant or young child babbles, gestures, or cries, and an adult responds appropriately ......
Read more >
BI solution architecture in the Center of Excellence - Power BI
In this article. Frameworks; Data models; Data warehouse; Data sources; Data ingestion; Data storage; Data consumption; Next steps.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found