question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conflict between Tensorboard and Tensorflow Training (accessing Checkpoints)

See original GitHub issue

When I run the Tensorflow Object Detection API, start a training, interrupt the training and continue the training later while the Tensorboard is running, training failes, because it tries to rename some checkpoint files, which are apparently locked by the Tensorboard:

2018-01-19 15:54:45.633575: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\
core\framework\op_kernel.cc:1192] Unknown: Failed to rename: C:/Users/Alex/Repositories/MusicObjec
tDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_p
retrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index.tempstate676747125244
4121708 to: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints
-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-t
rain\model.ckpt-92013.index : Access is denied.
; Input/output error
INFO:tensorflow:Error reported to Coordinator: <class tensorflow.python.framework.errors_impl.Unkn
ownError'>, Failed to rename: C:/Users/Alex/Repositories/MusicObjectDetector-TF/MusicObjectDetecto
r/data/checkpoints-faster_rcnn_inception_resnet_v2_atrous_muscima_pretrained_with_stafflines_dimen
sion_clustering2-train\model.ckpt-92013.index.tempstate6767471252444121708 to: C:/Users/Alex/Repos
itories/MusicObjectDetector-TF/MusicObjectDetector/data/checkpoints-faster_rcnn_inception_resnet_v
2_atrous_muscima_pretrained_with_stafflines_dimension_clustering2-train\model.ckpt-92013.index : 
Access is denied.

I was wondering, if it would be possible to make sure that the Tensorboard does not lock out any other processes? Or is this entirely impossible to read a file, without locking? I don’t know what the TensorBoard actually reads from the *.index file that takes longer than a split-second, releasing the file immediately afterwards. I understand, that loading the events from the events.out.tfevents.*.* takes a while to process, but there is apparently works.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
scigncommented, Dec 17, 2018

I have found that I get this error if I have an Explorer window also watching the folder. I close the Explorer window and the error stopped appearing. That leads me to think that it’s Explorer that is locking the file and TensorBoard gets locked out.

0reactions
nikzaselcommented, Nov 18, 2019

Same thing happening to me as well. If you turn off the tensorboard and the explorer process, it works without hesitation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resume Training tf.keras Tensorboard - Stack Overflow
It's very simple. Create checkpoints while training the model and then use those checkpoints to resume training from where you left of. And...
Read more >
Training checkpoints | TensorFlow Core
The phrase "Saving a TensorFlow model" typically means one of two things: Checkpoints, OR; SavedModel. Checkpoints capture the exact value of all parameters ......
Read more >
How to use the ModelCheckpoint callback with Keras and ...
Learn how to monitor a given metric such as validation loss during training and then save high-performing networks to disk.
Read more >
tensorflow/tensorboard - GitLab
For an in-depth example of using TensorBoard, see the tutorial: TensorBoard: Visualizing Learning. For in-depth information on the Graph Visualizer, ...
Read more >
Common Modules in TensorFlow
Checkpoint only saves the parameters (variables) of the model, ... The visual interface of the TensorBoard can then be accessed by using a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found