Chained Schedulers
See original GitHub issueCurrently, if one wants to have a warmup or cooldown period, one has to write a custom scheduler for that. Instead, we could have an abstraction for supporting multiple scheduling periods. This could be done via 1) a ChainedScheduler
-like approach, 2) applying multiple schedulers simultaneously, or 3) making scheduling periods a first-class feature of the trainer.
I am leaning against option 3 because it would overly complicate the trainer for the generic use case where one only has one training period. While option 2 would be syntactically simpler, it would introduce challenges with SSR. As such, I am leaning towards option 1.
For option 1:
- Each ComposerScheduler should take an
apply_ssr
parameter (which can be set to False if being used in warmup) and anend_time
orperiod_length
parameter (or both, but only allow one to be set). A scheduler should not assume astart_time
– that would be determined implicitly whenever the scheduler is first__call__
ed. - We would also need to modify the
__call__
API of the ComposerScheduler such that a scheduler should returnNone
when it is finished (i.e. itsend_time
orperiod_length
has been elapsed). None would signal to aChainedScheduler
that the scheduler is done and is no longer managing the learning rate. If the trainer receivedNone
, it would interpret that as to not modify the learning rate (which would be equivalent to returning the last returned value). For schedulers (e.g. cooldown) that should run until training end, theend_time
parametercould be
1dur`, in which case it would never return None. - We can have a
ChainedScheduler(schedulers: List[Scheduler] | OrderedDict[str, Scheduler])
-like API . (A dict could be used to give each period a name – e.g. “warmup”). Whenever theChainedScheduler
is__call__
ed, it would__call__
the 0th scheduler until it returns None; then it would move on to the 1st scheduler until that returns None, etc. After all schedulers would be exhausted, it would returnNone
, which would signal to the trainer to leave the learning rate the same. - The
ChainedScheduler
would make available the currently active scheduler as an object. Assuming only one ChainedScheduler is being used, algorithms could inspectstate.schedulers[0].active_scheduler
,state.schedulers[0].active_scheduler_idx
, orstate.schedulers[0].active_scheduler_name
to determine which learning rate period we are in.
Thoughts?
CC @dblalock @jbloxham @mosaicml/composer-team-research
_Originally posted by @ravi-mosaicml in https://github.com/mosaicml/composer/issues/632#issuecomment-1056988994_
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Generally agree with @hanlint - there’s a lot of complexity going on here that will lead to many edge cases. I generally find it more intuitive, for warmup schedulers, to specify the total duration of the composite scheduler, than to specify it as the sum of two individual parts. It’s also potentially necessary for some schedulers w/ warmup to be aware of the length of the warmup period.
We can make it easier to implement new schedulers by offering helping functions to calculate things like “tau” from the scheduler docs I wrote, but I also think it’s already not really that hard to write a scheduler: http://localhost:8000/api_reference/composer.optim.scheduler.html#composer.optim.scheduler.CosineAnnealingWithWarmupScheduler
Created https://github.com/mosaicml/composer/issues/671.