Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Time Abstraction

See original GitHub issue

🚀 Time Abstraction

Motivation

There are various measures of time during training, and we need a common steppable abstraction to handle conversion between units. In the CV community, it is common to track time in terms of samples in batches. However, in NLP, it is more common to track time in terms of tokens and the duration of the training process. Here, we propose a time tracking solution.

Implementation

After discussion with @abhi-mosaic and @moinnadeem, we are leaning towards the following design:

1. Time objects will simplify time arithmatic. A Time object consists of an integer and a unit, which will be one of epochs, batches, samples, or tokens. Via overloaded functions, Time objects will support comparisons, addition, and subtraction against other Time objects of same units, and for backwards compatibility, raw integers (though, in that case, a UserWarning will be emitted). They will also have getters to get the underlying value as an integer and unit.
2. A Timer object, attached to the trainer’s state, will track epochs, batches, samples, and tokens. Types are Time objects, except for tokens which may be None (for non-NLP training jobs). The timer object will have getters for each of these fields and a single update function that the training loop will call to update the timer at the end of every batch – e.g. timer.update(samples=X, token=Y).
3. To determine the number of samples and number of tokens, a dataset can provide get_batch_size(batch) and get_num_tokens(batch). If not specified, the default get_batch_size() will be used, and tokens will NOT be tracked.
4. Datasets can optionally provide __len__ and get_num_tokens(). By pytorch convention, __len__ should be the number of samples in the dataset. get_num_tokens can either return a constant number, perform some sort of computation upon initialization to determine the number of tokens in the dataset, or (by default) return None if the number of tokens is unknown.
5. The max_epochs property in the trainer hparams will be replaced with max_duration, where duration can be specified in terms of epochs, steps, samples, or tokens.
6. The trainer will have a function trainer.get_elapsed_duration() that will query the timer object and return a float on [0,1] representing how much of the training process has been completed (relative to the max_duration parameter).
7. The timing module (NOT the timer object) will have a static method like:
```
convert(time_string, desired_unit, dataset_num_samples: Optional[int] = None, dataset_num_tokens: Optional[int] = None, max_training_duration: Optional[str] = None, batch_size: Optional[int] = None):
    pass
```
This static method performs a static conversion between the specified time string and desired unit. Depending on the conversion being performed, dataset_num_samples, dataset_num_tokens, max_training_duration, and/or batch_size will need to be provided. These parameters must be explicitly provided to emphasize that this function is a static conversion, done at the time of conversion, and may be inaccurate if these parameters later change (e.g. an algorithm changes the training duration). The follow conversions are allowed. 1. epochs <-> batches, if dataset_num_samples and batch_size are defined 1. epochs <-> samples, if dataset_num_samples is defined 1. batches <-> samples, if batch_size is defined 1. epochs <-> tokens, if dataset_num_tokens is defined. 1. duration <-> unit of max_duration: You can convert a duration string (e.g. “0.1dur”) into the unit (e.g. ep) of max_duration (e.g. “90ep”) – e.g. would return 9 1 duration <-> other units. If a unit other than that of max_duration is specified, then the conversion will attempt to use one or more of the above conversions to perform it.

We will rewrite all schedulers to query the time object and perform a closed-form calculation to determine the learning rate, using timer.get_elapsed_duration and timer.get_num_XXX calls, so they are compatible with datasets of unknown size or tokens. However, this can be done later, and for the time being, timer.convert calls can be used to properly initialize schedulers upon creation.

TODO

PR 1: Build out the timer, and use the timer to track progress in the training loop. Update the state object. Should be a non-breaking change.
PR 2: Update the rest of the codebase to support timing strings (e.g. in schedulers, checkpoint interval, flush intervals, etc…). If needed, use timer.convert to be compatible with existing pytorch components.
PR 3: Create our own drop-in replacements for the pytorch schedulers that do not depend on timer.convert.
PR 4 (can be concurrent, or maybe should be done with PR 3): Update the algorithms. Try to avoid using the timer in the functional form.

Be mindful about using state’s (and not hparam’s) max_duration
Are there going to be scenarios in which time or timer objects are used prior to scale_schedule being called, and if so, will it cause problems that scale_schedule has changed max_duration between calls/uses of time/timer?

For schedulers, I think the main concern is that using scheduler.step assumes things about how Time passes, and places time state within the Scheduler (which can fall out of sync), whereas the cleaner way would be to treat the scheduler as a stateless function that returns the decay factor given the current Time, something like scheduler.get_factor(timer).

Big +1 here

1reaction

abhi-mosaiccommented, Dec 16, 2021

Re. batch_size, I think we are planning ahead for variable-batch size algorithms, which are already used in NLP for warmups. So it would be safer to query the current batch size at each step rather than hard-code it at the start.

For schedulers, I think the main concern is that using scheduler.step assumes things about how Time passes, and places time state within the Scheduler (which can fall out of sync), whereas the cleaner way would be to treat the scheduler as a stateless function that returns the decay factor given the current Time, something like scheduler.get_factor(timer).

My hope is that our reimplementations for the common schedulers will actually be cleaner than Pytorchs, almost like one-line functions. And making a custom Scheduler should also be pretty easy. Something like:

class TriangleScheduler():
  def get_factor(timer: Timer) -> float:
    frac = timer.get_current_period_frac_elapsed()
    return frac if frac <= 0.5 else (1 - frac)


class MilestoneDecayScheduler():
  def __init__(self, milestones: List[Time], decay: float):
    super().__init__()
    self.milestones = milestones
    self.decay = decay

  def get_factor(timer: Timer) -> float:
    period_time = timer.get_current_period_time()
    num_drops = [m >= period_time for m in self.milestones]
    return self.decay ** (num_drops)

And YAMLs can look like:

...
max_duration: 90ep
max_duration_base_units: sp  # Need to have base_units in order to handle non-whole-epoch periods 
periods:
  - warmup:
      duration: 0.05dur
      scheduler: LinearWarmup
  - main:
      duration: -1  # equivalent to 0.95dur
      scheduler: MilestoneDecay
        - 0.5dur
        - 0.75dur

whereas our current Trainer is only capable of handling this:

...
max_duration: 90ep
max_duration_base_units: ep  # <<< only whole epochs
periods:
  - warmup:
      duration: 0.05dur # <<< set to int(90*0.05) = 4 epochs :(
      scheduler: LinearWarmup
  - main:
      duration: -1  # equivalent to 0.95dur
      scheduler: MilestoneDecay
        - 0.5dur # <<< set to int(86*0.5) = 43 epochs :(
        - 0.75dur # <<< set to int(86*0.75) = 64 epochs :(

What do you think @A-Jacobson ?