Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In Multi GPU DDP, pytorch-lightning creates several tfevents files

See original GitHub issue

Describe the bug

Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs:

-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880714.google2-compute82.3156.0
-rw-rw-r--. 1 sam sam 165K Sep 19 08:22 events.out.tfevents.1568880716.google2-compute82.3186.0
-rw-rw-r--. 1 sam sam   40 Sep 19 08:11 events.out.tfevents.1568880718.google2-compute82.3199.0

I suppose the first one is created by the main process and the next 2 are created by the 2 DDP processes (one per GPU). Unfortunately, the actual events are not logged in the last created one, and that confuses tensorboard, cf https://github.com/tensorflow/tensorboard/issues/1011

I have to restart tensorboard if I want to see the new data.

A clear and concise description of what the bug is.

To Reproduce Launch any training on multi GPU DDP.

Expected behavior Only one tfevent file is created, from the master GPU.

Issue Analytics

State:
Created 4 years ago
Comments:20 (17 by maintainers)

Top GitHub Comments

2reactions

s-rogcommented, Nov 5, 2019

I removed that call and I’m still getting multiple tfevents, no other calls to logging besides metrics returned by train and val steps. Currently using the experimental --reload_multifile=true in tensorboard to get around the issue.

1reaction

awaelchlicommented, Jul 4, 2020

if you manually log things, then do this:

if self.trainer.is_global_zero:
    # your custom non-Lightning logging

Top Results From Across the Web

In Multi GPU DDP, pytorch-lightning creates several tfevents files

Describe the bug Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs: -rw-rw-r--.

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

Lightning supports multiple ways of doing distributed training. ... If you request multiple GPUs or nodes without setting a mode, DDP Spawn will...

Multi GPU training with DDP - PyTorch

Constructing the process group. The process group can be initialized by TCP (default) or from a shared file-system. Read more on process group...

Multi-Node Multi-GPU Comprehensive Working Example for ...

This blogpost provides a comprehensive working example of training a PyTorch Lightning model on an AzureML GPU cluster consisting of ...

How to preserve dataset order when using DDP in pytorch ...

I need to be able to preserve the order in which the data is fed to the model when training in multiple GPUS....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

In Multi GPU DDP, pytorch-lightning creates several tfevents files

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Question about multiple optimizers and lr schedulers?

Tensorboard logging in multi-gpu setting not working properly?