Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add distributed training

See original GitHub issue

Multi-GPU training using DistributedDataParallel (single node for now)
- Pytorch DataParallel is no longer recommended

Issue Analytics

State:
Created 3 years ago
Comments:11

Top GitHub Comments

3reactions

joonsoncommented, Sep 25, 2020

Added in a new branch distributed. This is the configuration used to produce EER 1.1771 in the released pre-trained model. Note that 8 GPUs were used to train this model, so test_interval and max_epoch must be changed accordingly if you want to use a different number of GPUs.

0reactions

lawlictcommented, Nov 22, 2020

@joonson Yes, of course. I think the number of training steps is too small.

Also in this comment there is separate config vs one mentioned above. Are both of them producing same results?

Another configuration here takes much more training steps, but I don’t have time to test it now.

Top Results From Across the Web

PyTorch Distributed Overview

Distributed Data-Parallel Training (DDP) is a widely adopted single-program multiple-data training paradigm. With DDP, the model is replicated on every process, ...

Configuring distributed training for PyTorch | AI Platform Training

For distributed PyTorch training, configure your job to use one master worker node and one or more worker nodes. These roles have the...

Distributed training with TensorFlow

Overview. tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs.

PyTorch Distributed Training - Lei Mao's Log Book

In this blog post, I would like to present a simple implementation of PyTorch distributed training on CIFAR-10 classification using ...

Multi node PyTorch Distributed Training Guide For People In A ...

The goal of this tutorial is to give a summary of how to write and launch PyTorch distributed data parallel jobs across multiple...