Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training

See original GitHub issue

Thanks for open sourcing the code for this awesome paper!

I’m wondering if you used distributed training of the different GAN models during experimentation. If so, could you share an example of how to launch a distributed training job using compare_gan code?

Issue Analytics

State:
Created 5 years ago
Comments:5

Top GitHub Comments

1reaction

Marvin182commented, Sep 18, 2019

Note: We have updated the framework in the meantime and it now supports distributed training (single run on multiple machines) for TPUs.

1reaction

kkurachcommented, Mar 22, 2018

Hi Joppe,

the training of a single GAN is done on a single GPU (it’s relatively fast for the architecture and datasets that we used).

We launched multiple experiments in parallel - first by running compare_gan_generate_tasks to create a set of experiment to run, then by running compare_gan_run_one_task on many machines (machine 0 with task_num=0, machine 1 with task_num=1, etc)

Top Results From Across the Web

Distributed Training: Guide for Data Scientists - neptune.ai

Precisely, in distributed training, we divide our training workload across multiple processors while training a huge deep learning model. These processors are ...

Distributed Training for Machine Learning - Amazon AWS

Complete distributed training up to 40% faster ... Amazon SageMaker offers the fastest and easiest methods for training large deep learning models and...

Distributed Training - Run:AI

As its name suggests, distributed training distributes training workloads across multiple mini-processors. These mini-processors, referred to as worker nodes, ...

Distributed training with TensorFlow

tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can ...

What is distributed training? | Anyscale

The gist is that distributed training tools spread the training workload within a cluster and on a local workstation with multiple CPUs. Let's ......