Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Validation for non-distributed training

See original GitHub issue

Hi!

I was going through the code and found that the validate argument in the _non_dist_train function in mmdet/apis/train.py is not being used. I guess that needs to be incorporated in the function. Please let me know if that’s the case.

Thanks!

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

3reactions

ferranrigualcommented, Aug 12, 2019

Just in case it helps anyone.

In mmdet/core/evaluation/eval_hooks.py, not only did I comment the dist.barrier() lines, I also replaced: result = runner.model(return_loss=False, rescale=True, **data_gpu) with: result = runner.model.module(return_loss=False, rescale=True, **data_gpu)

And then I was able to train with 4 GPUs and validate. Best,

3reactions

dhananjaisharma10commented, Aug 2, 2019

You are right. It is currently not supported for non-distributed training to evaluate the mAP after training epochs.

Hey! I was wondering if I could contribute to the code by simply making a few tweaks to make the code work for distributed/nondistributed training with validation. Also, something similar to the tools/test.py file. Though I’m not sure whether that will be the best way, please let me know if I should go ahead and submit a pull request. Thanks!

Top Results From Across the Web

Train a model — MMSegmentation 0.29.1 documentation

MMSegmentation implements distributed training and non-distributed training, ... By default we evaluate the model on the validation set after some ...

Seemingly good results with training a CNN but bad when ...

It means that your model is very good on training data but ... Another issue could be that your train / validation split...

Distributed Input | TensorFlow Core

In a non-distributed training loop, first create a tf.data. ... You should only enable them after you validate that they benefit the performance...

How to run distributed training using Horovod and MXNet ...

You can convert the non-distributed training script to a Horovod world by ... [1,0]<stderr>:INFO:root:Training finished with Validation ...

Distributed Training of Knowledge Graph Embedding ...

model, we divided the data into train, test and validation sets. We ... tributed training using Ray against non-distributed training for.