Conflict between Tensorboard and Tensorflow Training (accessing Checkpoints)
See original GitHub issueWhen I run the Tensorflow Object Detection API, start a training, interrupt the training and continue the training later while the Tensorboard is running, training failes, because it tries to rename some checkpoint files, which are apparently locked by the Tensorboard:
2018-01-19 15:54:45.633575: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\
core\framework\op_kernel.cc:1192] Unknown: Failed to rename: C:/Users/Alex/Repositories/MusicObjec
tDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_p
retrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index.tempstate676747125244
4121708 to: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints
-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-t
rain\model.ckpt-92013.index : Access is denied.
; Input/output error
INFO:tensorflow:Error reported to Coordinator: <class tensorflow.python.framework.errors_impl.Unkn
ownError'>, Failed to rename: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetecto
r/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimen
sion_clustering2-train\model.ckpt-92013.index.tempstate6767471252444121708 to: C:/Users/Alex/Repos
itories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v
2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index :
Access is denied.
I was wondering, if it would be possible to make sure that the Tensorboard does not lock out any other processes? Or is this entirely impossible to read a file, without locking? I don’t know what the TensorBoard actually reads from the *.index
file that takes longer than a split-second, releasing the file immediately afterwards. I understand, that loading the events from the events.out.tfevents.*.*
takes a while to process, but there is apparently works.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Resume Training tf.keras Tensorboard - Stack Overflow
It's very simple. Create checkpoints while training the model and then use those checkpoints to resume training from where you left of. And...
Read more >Training checkpoints | TensorFlow Core
The phrase "Saving a TensorFlow model" typically means one of two things: Checkpoints, OR; SavedModel. Checkpoints capture the exact value of all parameters ......
Read more >How to use the ModelCheckpoint callback with Keras and ...
Learn how to monitor a given metric such as validation loss during training and then save high-performing networks to disk.
Read more >tensorflow/tensorboard - GitLab
For an in-depth example of using TensorBoard, see the tutorial: TensorBoard: Visualizing Learning. For in-depth information on the Graph Visualizer, ...
Read more >Common Modules in TensorFlow
Checkpoint only saves the parameters (variables) of the model, ... The visual interface of the TensorBoard can then be accessed by using a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have found that I get this error if I have an Explorer window also watching the folder. I close the Explorer window and the error stopped appearing. That leads me to think that it’s Explorer that is locking the file and TensorBoard gets locked out.
Same thing happening to me as well. If you turn off the tensorboard and the explorer process, it works without hesitation.