Training State Centered Framework vs. Engine Centered Architecture
See original GitHub issueHi!
I was thinking for some time about the current architecture of Ignite and how to improve the workflow both, during application development (writing training scripts) and during feature development. When writing training scripts I could not code the individual training I wanted and when trying to code the feature for my training script I ended up writing infrastructure/architecture code instead of implementing the feature. There were some kind of restrictions I first couldn’t identify…
Nevertheless, after a while I found 2 twists, actually nothing big… so here I want to come up with a under-the-hood-framework to integrate into Ignite that solved all my problems and many open issues! Nice right?
With this framework integrated into Ignite you achieve an extreme nice overview during debugging, enhance Ignite to a rapid feature dev tool, can handle far more complex (individual) use cases while achieving a higher degree of automation at the same time and have quite some new features and many more possibilities for more syntactic sugar.
But now comes the… BUT as I tried to fix it, unfortunately I had to realize it won’t work without major revisions. For that I went a long way to really provide proof and facts - something you can play with to make up your opinion - before daring to suggest a major revision…
So, if you’re interested in a up&running “what-if-when”-Ignite version with the 2 twists below untwisted, please have a look at the repository and the documentation and leave me a feedback - I’d really like to know your opinion.
In case enough of you like it and could imagine integrating the framework into Ignite, I could pull/request the code on an experimental branch and we see how it goes from there. (Note: I just pushed it to another repository because as far as i know you cannot pull/request a new branch - which this definitely needs.)
Everything else you need to know you will find in the Ignite Framework repo and the docu. For bugs & questions, let me know, thx!
So, set up your first coffee & enjoy playing!
Teaser from the documentation
Two issues
I am a fan of Ignite and that’s why I’m trying to contribute, but I discovered 2 shortcomings in the architecture and the implementation, that caused me quite some restrictions and coding infrastructure instead of programming new features (what I actually wanted to do). The issues are:
- Engine centered architecture: In current Ignite the
Engine
is the architectural center with the trainingstate
as attribute. The trainingstate
atttribute is a transient object that is only instantiated when theEngine
is inrun
-mode and vanishes afterwards. Also thestate
holds only a selective fraction of all variables and parameters that make up the real training state. SoEngine
is a kind of static object andstate
is transient. This does not represent the reality of the training process. In reality the training starts with an initial state holding all variables, parameters including e.g. model variables, hyperparameters etc. which then are modified while the state goes through different transitions. The main transitions of thestate
areEngine
s (normally more than one). So the state should be the architectual center holding ALL variables, parameters, values, transitions etc. and the Engine is (just) the main trainsition of the state. This small twist causes quite some complications for features and APIs which are listed below. - Event is broken in many pieces: Currently an
Event
is anEnum
that has to be explicitlyfire_event
ed, and implicitly_fire_event
ed so theevent_handlers
handle further callbacks. TheEvent
is always fired after some other training value has changed, e.g. the model output was updatedITERATION_COMPLETED
is fired. Also if you want to fire a non-standard event, you first have to create it, register it at eachEngine
that is supposed to use it and then the firing has to be implemented… But in reality an Event is nothing more than a value change of a training state variable that triggers callbacks. So all these pieces above can be put together by implementing a state variable as a descriptor.
Improvements from an underlying framework
You will experience the improvements given by the framework when working on all 3 levels: application implementation, feature development and framework development. The separation of these working areas is already the first improvement. Try out the benifits in detail & hands-on for the first to levels in the Quickunderstanding Application and Quickunderstanding Feature Dev.
Before you go through the theoretically described enhancements these few no-comment-teasers
of the training state
in the debugger will give you nice insights what’s ahead. It shows the Ignite example mnist_with_tensorboard.py transferred to the framework architecture just before the engines are started:
And here the engine state.engines.trainer
unfolded:
Or setting up all the below Tensorboard charts with these two simple comands:
# Automatically identify and generate metric chats comparing the different engines
EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
# Automatically generate for each engine a summary of all metric charts
EnginesMetricsCharts(x_axes_refs=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
By the way, if you had set up 10x more metrics and some more engines, these two command would not change to provide all comparative and single metric charts of all engines.
Soooo, if you’re intrested, then grab a coffee and press >>>PLAY<<<!
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (3 by maintainers)
Top GitHub Comments
I really think the best way is to move forward and start implementing parts of your proposals. I’m sure we’ll encounter some problems during that, but I think this is the only way to really get an overview here. Regarding the namespace: I think we should carefully think about what we’re adding. So I would not like to add everything just because we can. But I agree that some stuff would be beneficial. But I think these are too many details to discuss them all without an implementation proposal.
And this is also why I suggested the implementation order above: we start with the easy things that would most likely be a good idea and move on to the things we need to discuss base on these implementation details
Hi,
First of all: thank you very much for your work, this looks awesome!
I had a look at your code and what you did there is impressive. I have a few things to comment on:
1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don’t like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.
2.) Regarding the events: You’re right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don’t need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End…) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python’s core philosophy that explicit is better than implicit.
3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.
In general I really like your points (maybe not each part of the implementation, but that’s okay I guess). So what I would ultimately propose:
We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:
I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won’t be global
II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.
III) After this is done, we will most likely have to revisit your framework and look, what’s still missing from there and if it makes sense to actually include this here.
For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can. What do you think @DrStoop ?