question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation results vary for same saved weight.

See original GitHub issue

❓ Questions and Help

I have a question when I evaluated my model. I ran the command below several times and it returned different mAp results. python -m torch.distributed.launch --nproc_per_node=1 tools/test_net.py --config-file "stand_file/e2e_faster_rcnn_R_50_FPN_1x.yaml" TEST.IMS_PER_BATCH 16

I would like to know why this happened.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
fmassacommented, Jul 1, 2019

This probably happens because when you batch different images together, you have different paddings and that affect slightly the output of the model (i.e., the predictions).

In your case, you are using a batch size of 16, and by default we are shuffling the images during evaluation https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/build.py#L126 so that every run will have different batches of images, and thus different paddings and different results.

Try removing the shuffling or making the batch size to be 1 (which is the most robust solution anyway)

1reaction
xiaohai12commented, Jul 2, 2019

Hi @xiaohai12, how many GPUs did you use when running an evaluation? If you used only one GPU, then is_distributed was passed as False according to the following pieces of code which therefore would never enable shuffling whatever your TEST.IMS_PER_BATCH was. https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L50-L51

https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/tools/test_net.py#L96

However, using different values of TEST.IMS_PER_BATCH does result in different mAPs as explained by @fmassa that is caused by paddings.

For your information, here are my results by running the evaluation of e2e_faster_rcnn_R_50_FPN_1x (the model weights file is downloaded from model id: 6358793 in MODEL_ZOO) on 2 GPUs with several times:

  1. one image on each GPU (TEST.IMS_PER_BATCH = 2) whatever shuffle = True or shuffle = False
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
    Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
    Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.397
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.483
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
    
    AP, AP50, AP75, APs, APm, APl
    0.367747, 0.586073, 0.395546, 0.210593, 0.397355, 0.480963
    
  2. two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = True (by default)
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
    Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
    Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.483
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.543
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
    
    AP, AP50, AP75, APs, APm, APl
    0.367857, 0.586037, 0.395842, 0.210618, 0.398223, 0.480817
    
  3. two images on each GPU (TEST.IMS_PER_BATCH = 4) with shuffle = False (manually)
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.368
    Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.586
    Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.396
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.211
    Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
    Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.481
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.307
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.482
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.506
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
    Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
    Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
    
    AP, AP50, AP75, APs, APm, APl
    0.367695, 0.586173, 0.395646, 0.210543, 0.397500, 0.481209
    

As you can see, shuffling images or not (but keeping the same batch size) does result in different mAPs. However, I can always get the same results no matter how many times I run the evaluation with shuffle = True using 2 GPUs (the second results). This is because a random seed (0) is set before sampling batches according to the following code: https://github.com/facebookresearch/maskrcnn-benchmark/blob/55796a04ea770029a80cf5933cc5c3f3f6fa59cf/maskrcnn_benchmark/data/samplers/distributed.py#L43-L47

Hi @fmassa, I’m curious about why we should shuffle images during testing when using multiple GPUs. It seems that running the evaluation without shuffling can also get the same mAPs (the first results, batch size is 1 on each GPU)

Thanks for your response. I found that I added new Transform method and forgot to not do transforms when testing . Now it works well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Model Performance Mismatch Problem (and what to do ...
The procedure when evaluating machine learning models is to fit and evaluate them on training data, then verify that the model has good...
Read more >
Evaluating on training data gives different loss - Cross Validated
When using model.fit and model.evaluate on different datasets, the result will NEVER be exactly the same. There is a multitude of factors, ...
Read more >
Training & evaluation with the built-in methods - Keras
This guide covers training, evaluation, and prediction (inference) models ... State update and results computation are kept separate (in ...
Read more >
IoU a better detection evaluation metric - Towards Data Science
Choosing the best model architecture and pretrained weights for your task can be hard. If you've ever worked on an object detection problem...
Read more >
Evaluation Metrics Machine Learning - Analytics Vidhya
Learn different model evaluation metrics for machine learning like cross validation, confusion matrix, AUC-ROC, RMSE, Gini coefficients and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found