question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tensorboard logging in multi-gpu setting not working properly?

See original GitHub issue

Hi there 😃

I have a question (that may be an issue with the code or just my ignorance). (b.t.w. I am using the latest version, pytorch-lightning==0.4.9)

If I set the trainer

trainer = Trainer(experiment=exp, gpus=[0])

I can see the corresponding logging (scalars and hyperparameters) in Tensorboard. If I change it to distributed training (keeping the rest of the code unchanged) :

trainer = Trainer(experiment=exp, gpus=[0,1], distributed_backend='ddp')

The Tensorboard logging stops working at least for scalars and hyperparameters, I see nothing except the experiment name.

In both cases ‘exp’ is a Experiment instantiated like this:

exp = Experiment(save_dir=/SOME/PATH, name=NAME, version=VERSION, description=DESCRIPTION)

This picture illustrates the problem.

Untitled

In the picture the red arrows point to the “distributed” experiment, with no drawing in the chart. The other two (the ones that appear in the chart) are the very same, except than the run in single GPU.

Am I missing something or do I need to add extra configuration to make the logging work in multi-gpu with the ddp setting? Or is it a bug?

Thank you! 😃

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
williamFalconcommented, Sep 18, 2019

Ok, submitted a PR. Can you install this version and verify it works now?

pip install git+https://github.com/williamFalcon/pytorch-lightning.git@fix_tb_logger_rank --upgrade
0reactions
williamFalconcommented, Sep 19, 2019

@samhumeau we can open a new issue for this. When you init the experiment it creates a file handle. The solution is to make sure to init the experiment from the master gpu only. However there needs to be a way for this to happen so the user doesn’t think about it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorboard logging in multi-gpu setting not working properly?
Am I missing something or do I need to add extra configuration to make the logging work in multi-gpu with the ddp setting?...
Read more >
Tensorflow - Multi-GPU doesn't work for model(inputs) nor ...
It is supposed to run in single gpu (probably the first gpu, GPU:0 ) for any codes that are outside of mirrored_strategy.run() ....
Read more >
Migrate single-worker multiple-GPU training | TensorFlow Core
This guide demonstrates how to migrate the single-worker multiple-GPU workflows from TensorFlow 1 to TensorFlow 2. To perform synchronous training across ...
Read more >
PyTorch Profiler With TensorBoard
This tutorial demonstrates how to use TensorBoard plugin with PyTorch Profiler to detect performance bottlenecks of the model. Introduction. PyTorch 1.8 ...
Read more >
Release Notes — Determined AI Documentation
API: Fix issue that occasionally made TFKerasTrial hang for multi-GPU ... WebUI: Fix an issue where TensorBoard sources did not display properly for ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found