Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tune] Logging with multiple time intervals

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): Pip
Ray version: 0.6.3
Python version: 3.6.5

Describe the problem

The Trainable interface in tune expects the step method to output a logging dictionary. However, it is unclear how to annotate logging statements with different global steps. For example, one way want to record the model’s gradients at every training iteration, but may only want to record the dev metric once per epoch.

One solution that we are exploring is building an adapter Trainable interface, with a custom logger (default logger being deactivated), that would be passed to the step method. The logger would then have a method such as: log(key, value, time_step). The custom logger, when given a result (i.e on_result) would then parse the list of (key, value, time_step) tuples and output the correct tensorboard graphs. The only downside from this method is that the user would have to wait until the step is finished before seeing the logs for that step.

I was wondering if you had faced this question before, and had some thoughts about the best way to approach it. Thank you in advance for your help!

Issue Analytics

State:
Created 5 years ago
Comments:11 (8 by maintainers)

Top GitHub Comments

1reaction

richardliawcommented, Feb 27, 2019

@jeremyasapp - would something like this work?

def _train():
    batch = next(train_sampler)
    loss = train(batch)
    extras = {}
    if self._iteration % n_iter_per_step == 0:
        extras["metric"] = evaluate(self.dev_data)
    return dict(loss=loss, **extras)

Here, you’d get loss having the right training step, and extras would be reported every multiple of n_iter_per_step - I think this should work, but let me know if otherwise.

1reaction

jeremyasappcommented, Feb 25, 2019

Hi, thanks for your quick response!

So I think I understand what you mean but these lines in the TFLogger gave me the impression that all the metrics are given the same step:

train_stats = tf.Summary(value=values)
t = result.get(TIMESTEPS_TOTAL) or result[TRAINING_ITERATION]
self._file_writer.add_summary(train_stats, t)

What I’m confused about is that we only get to return a single dictionary of values every step, but within a step, it’s possible that you may generate multiple values for the same key. Take for example:

def step():
    losses = []
    # Do n training steps
    for i in (n_iter_per_step):
        batch = next(train_sampler)
        loss = train(batch)
        losses.append(loss)

    # Run evaluation
    metric = evaluate(self.dev_data)
    return dict(metric=metric, loss=losses)

In this example, it’s unclear to me how Tensorboard will go about parsing the list of values. Ideally I would call tf_summary with the correct training time step for each value in losses (i.e n_iter_per_step * global_step + i). Please let me know if that makes things more clear, or I can give a more concrete example. Thank you!