[Tune] Logging with multiple time intervals
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): Pip
- Ray version: 0.6.3
- Python version: 3.6.5
Describe the problem
The Trainable interface in tune expects the step method to output a logging dictionary. However, it is unclear how to annotate logging statements with different global steps. For example, one way want to record the model’s gradients at every training iteration, but may only want to record the dev metric once per epoch.
One solution that we are exploring is building an adapter Trainable interface, with a custom logger (default logger being deactivated), that would be passed to the step method. The logger would then have a method such as: log(key, value, time_step)
. The custom logger, when given a result (i.e on_result
) would then parse the list of (key, value, time_step) tuples and output the correct tensorboard graphs. The only downside from this method is that the user would have to wait until the step is finished before seeing the logs for that step.
I was wondering if you had faced this question before, and had some thoughts about the best way to approach it. Thank you in advance for your help!
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (8 by maintainers)
Top GitHub Comments
@jeremyasapp - would something like this work?
Here, you’d get
loss
having the right training step, and extras would be reported every multiple ofn_iter_per_step
- I think this should work, but let me know if otherwise.Hi, thanks for your quick response!
So I think I understand what you mean but these lines in the TFLogger gave me the impression that all the metrics are given the same step:
What I’m confused about is that we only get to return a single dictionary of values every step, but within a step, it’s possible that you may generate multiple values for the same key. Take for example:
In this example, it’s unclear to me how Tensorboard will go about parsing the list of values. Ideally I would call tf_summary with the correct training time step for each value in
losses
(i.en_iter_per_step * global_step + i
). Please let me know if that makes things more clear, or I can give a more concrete example. Thank you!