question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot reproduce training performance

See original GitHub issue

Hi Gyeongsik,

I am working on reproducing the numbers reported in the paper. Train dataset: H36M, MuCo, COCO Test dataset: 3DPW

I am using pytorch 1.8, python 3.8, cuda10


I did two runs. Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)

  1. Train Batch Size per GPU = 16, Number of GPUs = 4 (this is the default config)
MPJPE from lixel mesh: 96.23 mm
PA MPJPE from lixel mesh: 60.68 mm
  1. Train Batch Size per GPU = 24, Number of GPUs = 8 (bigger batch config)
MPJPE from lixel mesh: 96.37 mm
PA MPJPE from lixel mesh: 61.51 mm

I also trained the bigger batch config (run2) for the param stage. Here is the performance snapshot17.pth and snapshot15.pth (the best checkpoint) on 3DPW dataset.

snapshot17.pth, param stage
MPJPE from lixel mesh: 95.85 mm
PA MPJPE from lixel mesh: 61.21 mm
MPJPE from param mesh: 98.11 mm
PA MPJPE from param mesh: 61.64 mm
snapshot15.pth, param stage
MPJPE from lixel mesh: 95.65 mm
PA MPJPE from lixel mesh: 60.97 mm
MPJPE from param mesh: 97.22 mm
PA MPJPE from param mesh: 60.82 mm

I am still waiting on the param stage of the default config, will edit this then. But the reported MPJPE for lixel is 93.2 and it looks unlikely that I will converge there. Any suggestions? Should I train longer?

Thank you would greatly appreciate your help.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
mks0601commented, Sep 25, 2021

Sorry I changed common/base.py Now it gonna work

0reactions
Cakin-Kwongcommented, Mar 4, 2022

I am working on reproducing the result fo 3DPW. Train dataset: H36M, COCO Test dataset: 3DPW lr_dec_epoch = [10,12] end_epoch = 13 lr = 1e-4

The performance is as follow and cannot reach the performance in the paper: MPJPE from lixel mesh:99.05 mm PA MPJPE from lixel mesh: 62.68 mm

I wonder the training settings are all the same even if I use more data such as MuCo? Or I should use different training setting?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What do you do when you cannot reproduce experimental ...
I have run some experiments using open-sourced repos from the authors as my benchmarks. However, for some benchmarks I cannot reproduce ...
Read more >
4 Challenges of Reproducibility in the Machine Learning ...
An ML model only reproduces exact same result if the same data is used to train it. However, training data can not be...
Read more >
Issues - GitHub
I've benchmarked the code on DGX1 and could not reproduce the issue on our side. The command posted gives approx 1250img/s. I used...
Read more >
python - Not able to reproduce results with Tensorflow even ...
I can see this using model.get_weights() after creating the model (this is the case even when I restart the notebook and re-run the...
Read more >
How to Reproduce a Non-Reproducible Defect and Make ...
Speaking technically, if you can't reproduce a bug, you can never fix it. The following are some of the factors that determine if...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found