question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Eval gets stuck forever in the Trainer Component

See original GitHub issue

When I’m in the Trainer component, eval gets stuck forever:

[2019-05-13 17:40:53,499] {logging_mixin.py:95} INFO - [2019-05-13 17:40:53,499] {saver.py:1270} INFO - Restoring parameters from /home/benjamintan/workspace/darkrai/logs/shapes_768_1024_20190513T1740/model.ckpt-100
[2019-05-13 17:40:54,151] {logging_mixin.py:95} INFO - [2019-05-13 17:40:54,151] {session_manager.py:491} INFO - Running local_init_op.
[2019-05-13 17:40:54,198] {logging_mixin.py:95} INFO - [2019-05-13 17:40:54,198] {session_manager.py:493} INFO - Done running local_init_op.

Strangely, if I replace the path eval example path with the training example path, it manages to make progress to the model validator (though it fails model validation).

Any pointers on how to debug this?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ruoyu90commented, May 16, 2019

@benjamintanweihao Thanks for reporting this! Actually this helps surfaced a bug in our trainer executor. The train_files and eval_files should have consistent type. Will send a PR to fix this soon.

0reactions
benjamintanweihaocommented, May 16, 2019

I think I found the problem! I parsed the filenames parameter from eval_input_fn(filenames, transform_output) wrongly. It turns out that filenames is a string and not a list?

Read more comments on GitHub >

github_iconTop Results From Across the Web

TFTrainer stuck in evaluation · Issue #8347 - GitHub
Behavior of prediction_loop function causes evaluation to go in infinite loop if prediction_loss_only is set to True.
Read more >
Windows server 2016 Activation stuck at 10% for over 12 hours
All i can think is that the process continues to run in the background but it doesnt update the command prompt for some...
Read more >
6 Reasons Why Your Training is STUCK (Part 1 of 2)
Training is a constant refinement of technique. Always pursue better technique, but don't forget we're training for strength, not technical ...
Read more >
Python model.fit_generator gets stuck on first epoch and tries ...
When I get to this part of the code I run into a problem all of a sudden. I am running through google...
Read more >
Multi-GPU Evaluation Loss with Detectron 2 - Tom Shafer
I saw a post suggesting that different GPUs might be getting stuck in different parts of the code, since the hook system is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found