In Multi GPU DDP, pytorch-lightning creates several tfevents files
See original GitHub issueDescribe the bug
Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs:
-rw-rw-r--. 1 sam sam 40 Sep 19 08:11 events.out.tfevents.1568880714.google2-compute82.3156.0
-rw-rw-r--. 1 sam sam 165K Sep 19 08:22 events.out.tfevents.1568880716.google2-compute82.3186.0
-rw-rw-r--. 1 sam sam 40 Sep 19 08:11 events.out.tfevents.1568880718.google2-compute82.3199.0
I suppose the first one is created by the main process and the next 2 are created by the 2 DDP processes (one per GPU). Unfortunately, the actual events are not logged in the last created one, and that confuses tensorboard, cf https://github.com/tensorflow/tensorboard/issues/1011
I have to restart tensorboard if I want to see the new data.
A clear and concise description of what the bug is.
To Reproduce Launch any training on multi GPU DDP.
Expected behavior Only one tfevent file is created, from the master GPU.
Issue Analytics
- State:
- Created 4 years ago
- Comments:20 (17 by maintainers)
Top Results From Across the Web
In Multi GPU DDP, pytorch-lightning creates several tfevents files
Describe the bug Right now pytorch-lightning seems to create several tfevent files in the multi-gpu ddp way: e.g. for 2 GPUs: -rw-rw-r--.
Read more >GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... If you request multiple GPUs or nodes without setting a mode, DDP Spawn will...
Read more >Multi GPU training with DDP - PyTorch
Constructing the process group. The process group can be initialized by TCP (default) or from a shared file-system. Read more on process group...
Read more >Multi-Node Multi-GPU Comprehensive Working Example for ...
This blogpost provides a comprehensive working example of training a PyTorch Lightning model on an AzureML GPU cluster consisting of ...
Read more >How to preserve dataset order when using DDP in pytorch ...
I need to be able to preserve the order in which the data is fed to the model when training in multiple GPUS....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

I removed that call and I’m still getting multiple tfevents, no other calls to logging besides metrics returned by train and val steps. Currently using the experimental
--reload_multifile=truein tensorboard to get around the issue.if you manually log things, then do this: