question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In Multi GPU DDP, pytorch-lightning creates several tfevents files

See original GitHub issue

Describe the bug

Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs:

-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880714.google2-compute82.3156.0
-rw-rw-r--. 1 sam sam 165K Sep 19 08:22 events.out.tfevents.1568880716.google2-compute82.3186.0
-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880718.google2-compute82.3199.0

I suppose the first one is created by the main process and the next 2 are created by the 2 DDP processes (one per GPU). Unfortunately, the actual events are not logged in the last created one, and that confuses tensorboard, cf https://github.com/tensorflow/tensorboard/issues/1011

I have to restart tensorboard if I want to see the new data.

A clear and concise description of what the bug is.

To Reproduce Launch any training on multi GPU DDP.

Expected behavior Only one tfevent file is created, from the master GPU.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:20 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
s-rogcommented, Nov 5, 2019

I removed that call and I’m still getting multiple tfevents, no other calls to logging besides metrics returned by train and val steps. Currently using the experimental --reload_multifile=true in tensorboard to get around the issue.

1reaction
awaelchlicommented, Jul 4, 2020

if you manually log things, then do this:

if self.trainer.is_global_zero:
    # your custom non-Lightning logging
Read more comments on GitHub >

github_iconTop Results From Across the Web

In Multi GPU DDP, pytorch-lightning creates several tfevents files
Describe the bug Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs: -rw-rw-r--.
Read more >
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... If you request multiple GPUs or nodes without setting a mode, DDP Spawn will...
Read more >
Multi GPU training with DDP - PyTorch
Constructing the process group. The process group can be initialized by TCP (default) or from a shared file-system. Read more on process group...
Read more >
Multi-Node Multi-GPU Comprehensive Working Example for ...
This blogpost provides a comprehensive working example of training a PyTorch Lightning model on an AzureML GPU cluster consisting of ...
Read more >
How to preserve dataset order when using DDP in pytorch ...
I need to be able to preserve the order in which the data is fed to the model when training in multiple GPUS....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found