question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add distributed training

See original GitHub issue
  • Multi-GPU training using DistributedDataParallel (single node for now)
    • Pytorch DataParallel is no longer recommended

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11

github_iconTop GitHub Comments

3reactions
joonsoncommented, Sep 25, 2020

Added in a new branch distributed. This is the configuration used to produce EER 1.1771 in the released pre-trained model. Note that 8 GPUs were used to train this model, so test_interval and max_epoch must be changed accordingly if you want to use a different number of GPUs.

0reactions
lawlictcommented, Nov 22, 2020

@joonson Yes, of course. I think the number of training steps is too small.

Also in this comment there is separate config vs one mentioned above. Are both of them producing same results?

Another configuration here takes much more training steps, but I don’t have time to test it now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch Distributed Overview
Distributed Data-Parallel Training (DDP) is a widely adopted single-program multiple-data training paradigm. With DDP, the model is replicated on every process, ...
Read more >
Configuring distributed training for PyTorch | AI Platform Training
For distributed PyTorch training, configure your job to use one master worker node and one or more worker nodes. These roles have the...
Read more >
Distributed training with TensorFlow
Overview. tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs.
Read more >
PyTorch Distributed Training - Lei Mao's Log Book
In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using ...
Read more >
Multi node PyTorch Distributed Training Guide For People In A ...
The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found