question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

compute average training speed and time after training from (.log.json) file.

See original GitHub issue

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help. yes
  2. I have read the FAQ documentation but cannot get the expected help. yes
  3. The bug has not been fixed in the latest version. yes , as far as i know

not display correct value for train time ,see bellow image. Describe the bug A clear and concise description of what the bug is.

i am trying to train CenterNet in my own custom dataset ,which has the same format of COCO dataset. i got the results of the training as (.log.json) and got images tested by a desired (epoch.pth) file . but when i try to analysis the training and find the time of training i got a nan value here is the error massage:

error average time of traning is nan

Reproduction

  1. What command or script did you run?
python tools/analysis_tools/analyze_logs.py cal_train_time train_results/centernet/20220309_152747.log.json

  1. Did you make any modifications on the code or config? Did you understand what you have modified? yes , i made same modification on other models ,for example fasterRCNN , CentriapetalNet , and it works ,and i got the results of training time.
  2. What dataset did you use? my custom dataset ,with a same format of COCO one.

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.

TorchVision: 0.8.0a0 OpenCV: 4.3.0 MMCV: 1.3.8 MMCV Compiler: GCC 7.5 MMCV CUDA Compiler: 11.0 MMDetection: 2.18.1+393c376

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source] it was installed before in the server (GPU) using some docker files.
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) i install seaborn using this instruction pip install seaborn as the system asked me.

Error traceback If applicable, paste the error trackback here.

tools/analysis_tools/analyze_logs.py:21: RuntimeWarning: Mean of empty slice.
  epoch_ave_time = all_times.mean(-1)
/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
/opt/conda/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15

github_iconTop GitHub Comments

1reaction
jbwang1997commented, Mar 10, 2022

Yes, I think they are all caused by this reason.

0reactions
jbwang1997commented, Mar 12, 2022

The GPU number is set in the command line. Not displayed in config.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Log Analysis — MMDetection 2.6.0 documentation
Compute the average training speed. python tools/analyze_logs.py cal_train_time log.json [--include-outliers]. The output is expected to be like the ...
Read more >
Trainer - Hugging Face
The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the example...
Read more >
Calculate speed, distance and time - GeeksforGeeks
This equation shows the relationship between speed, distance travelled and time taken: Speed is distance divided by the time taken. For example, ...
Read more >
Display Deep Learning Model Training History in Keras
In this post, you will discover how you can review and visualize the performance of deep learning models over time during training in...
Read more >
DeepSpeed Configuration JSON
Number of training steps to accumulate gradients before averaging and applying them. ... coeff_beta, Coefficient used for computing running averages of lamb ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found