Eval gets stuck forever in the Trainer Component
See original GitHub issueWhen I’m in the Trainer component, eval gets stuck forever:
[2019-05-13 17:40:53,499] {logging_mixin.py:95} INFO - [2019-05-13 17:40:53,499] {saver.py:1270} INFO - Restoring parameters from /home/benjamintan/workspace/darkrai/logs/shapes_768_1024_20190513T1740/model.ckpt-100
[2019-05-13 17:40:54,151] {logging_mixin.py:95} INFO - [2019-05-13 17:40:54,151] {session_manager.py:491} INFO - Running local_init_op.
[2019-05-13 17:40:54,198] {logging_mixin.py:95} INFO - [2019-05-13 17:40:54,198] {session_manager.py:493} INFO - Done running local_init_op.
Strangely, if I replace the path eval example path with the training example path, it manages to make progress to the model validator (though it fails model validation).
Any pointers on how to debug this?
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
TFTrainer stuck in evaluation · Issue #8347 - GitHub
Behavior of prediction_loop function causes evaluation to go in infinite loop if prediction_loss_only is set to True.
Read more >Windows server 2016 Activation stuck at 10% for over 12 hours
All i can think is that the process continues to run in the background but it doesnt update the command prompt for some...
Read more >6 Reasons Why Your Training is STUCK (Part 1 of 2)
Training is a constant refinement of technique. Always pursue better technique, but don't forget we're training for strength, not technical ...
Read more >Python model.fit_generator gets stuck on first epoch and tries ...
When I get to this part of the code I run into a problem all of a sudden. I am running through google...
Read more >Multi-GPU Evaluation Loss with Detectron 2 - Tom Shafer
I saw a post suggesting that different GPUs might be getting stuck in different parts of the code, since the hook system is...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@benjamintanweihao Thanks for reporting this! Actually this helps surfaced a bug in our trainer executor. The
train_files
andeval_files
should have consistent type. Will send a PR to fix this soon.I think I found the problem! I parsed the
filenames
parameter fromeval_input_fn(filenames, transform_output)
wrongly. It turns out thatfilenames
is a string and not a list?