question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation for non-distributed training

See original GitHub issue

Hi!

I was going through the code and found that the validate argument in the _non_dist_train function in mmdet/apis/train.py is not being used. I guess that needs to be incorporated in the function. Please let me know if that’s the case.

Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
ferranrigualcommented, Aug 12, 2019

Just in case it helps anyone.

In mmdet/core/evaluation/eval_hooks.py, not only did I comment the dist.barrier() lines, I also replaced: result = runner.model(return_loss=False, rescale=True, **data_gpu) with: result = runner.model.module(return_loss=False, rescale=True, **data_gpu)

And then I was able to train with 4 GPUs and validate. Best,

3reactions
dhananjaisharma10commented, Aug 2, 2019

You are right. It is currently not supported for non-distributed training to evaluate the mAP after training epochs.

Hey! I was wondering if I could contribute to the code by simply making a few tweaks to make the code work for distributed/nondistributed training with validation. Also, something similar to the tools/test.py file. Though I’m not sure whether that will be the best way, please let me know if I should go ahead and submit a pull request. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Train a model — MMSegmentation 0.29.1 documentation
MMSegmentation implements distributed training and non-distributed training, ... By default we evaluate the model on the validation set after some ...
Read more >
Seemingly good results with training a CNN but bad when ...
It means that your model is very good on training data but ... Another issue could be that your train / validation split...
Read more >
Distributed Input | TensorFlow Core
In a non-distributed training loop, first create a tf.data. ... You should only enable them after you validate that they benefit the performance...
Read more >
How to run distributed training using Horovod and MXNet ...
You can convert the non-distributed training script to a Horovod world by ... [1,0]<stderr>:INFO:root:Training finished with Validation ...
Read more >
Distributed Training of Knowledge Graph Embedding ...
model, we divided the data into train, test and validation sets. We ... tributed training using Ray against non-distributed training for.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found