Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot reproduce training performance

See original GitHub issue

Hi Gyeongsik,

I am working on reproducing the numbers reported in the paper. Train dataset: H36M, MuCo, COCO Test dataset: 3DPW

I am using pytorch 1.8, python 3.8, cuda10

I did two runs. Here is the performance of snapshot12.pth on 3DPW dataset (last checkpoint of lixel stage)

Train Batch Size per GPU = 16, Number of GPUs = 4 (this is the default config)

MPJPE from lixel mesh: 96.23 mm
PA MPJPE from lixel mesh: 60.68 mm

Train Batch Size per GPU = 24, Number of GPUs = 8 (bigger batch config)

MPJPE from lixel mesh: 96.37 mm
PA MPJPE from lixel mesh: 61.51 mm

I also trained the bigger batch config (run2) for the param stage. Here is the performance snapshot17.pth and snapshot15.pth (the best checkpoint) on 3DPW dataset.

snapshot17.pth, param stage
MPJPE from lixel mesh: 95.85 mm
PA MPJPE from lixel mesh: 61.21 mm
MPJPE from param mesh: 98.11 mm
PA MPJPE from param mesh: 61.64 mm

snapshot15.pth, param stage
MPJPE from lixel mesh: 95.65 mm
PA MPJPE from lixel mesh: 60.97 mm
MPJPE from param mesh: 97.22 mm
PA MPJPE from param mesh: 60.82 mm

I am still waiting on the param stage of the default config, will edit this then. But the reported MPJPE for lixel is 93.2 and it looks unlikely that I will converge there. Any suggestions? Should I train longer?

Thank you would greatly appreciate your help.

Issue Analytics

State:
Created 2 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

mks0601commented, Sep 25, 2021

Sorry I changed common/base.py Now it gonna work

0reactions

Cakin-Kwongcommented, Mar 4, 2022

I am working on reproducing the result fo 3DPW. Train dataset: H36M, COCO Test dataset: 3DPW lr_dec_epoch = [10,12] end_epoch = 13 lr = 1e-4

The performance is as follow and cannot reach the performance in the paper： MPJPE from lixel mesh:99.05 mm PA MPJPE from lixel mesh: 62.68 mm

I wonder the training settings are all the same even if I use more data such as MuCo? Or I should use different training setting？

Read more comments on GitHub >

Top Results From Across the Web

What do you do when you cannot reproduce experimental ...

I have run some experiments using open-sourced repos from the authors as my benchmarks. However, for some benchmarks I cannot reproduce ...

4 Challenges of Reproducibility in the Machine Learning ...

An ML model only reproduces exact same result if the same data is used to train it. However, training data can not be...

Issues - GitHub

I've benchmarked the code on DGX1 and could not reproduce the issue on our side. The command posted gives approx 1250img/s. I used...

python - Not able to reproduce results with Tensorflow even ...

I can see this using model.get_weights() after creating the model (this is the case even when I restart the notebook and re-run the...

How to Reproduce a Non-Reproducible Defect and Make ...

Speaking technically, if you can't reproduce a bug, you can never fix it. The following are some of the factors that determine if...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Learning rate decrease code problem

There seems to be an error or offset 3d GT pose on the H36M test set