question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training

See original GitHub issue

Thanks for open sourcing the code for this awesome paper!

I’m wondering if you used distributed training of the different GAN models during experimentation. If so, could you share an example of how to launch a distributed training job using compare_gan code?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
Marvin182commented, Sep 18, 2019

Note: We have updated the framework in the meantime and it now supports distributed training (single run on multiple machines) for TPUs.

1reaction
kkurachcommented, Mar 22, 2018

Hi Joppe,

the training of a single GAN is done on a single GPU (it’s relatively fast for the architecture and datasets that we used).

We launched multiple experiments in parallel - first by running compare_gan_generate_tasks to create a set of experiment to run, then by running compare_gan_run_one_task on many machines (machine 0 with task_num=0, machine 1 with task_num=1, etc)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed Training: Guide for Data Scientists - neptune.ai
Precisely, in distributed training, we divide our training workload across multiple processors while training a huge deep learning model. These processors are ...
Read more >
Distributed Training for Machine Learning - Amazon AWS
Complete distributed training up to 40% faster ... Amazon SageMaker offers the fastest and easiest methods for training large deep learning models and...
Read more >
Distributed Training - Run:AI
As its name suggests, distributed training distributes training workloads across multiple mini-processors. These mini-processors, referred to as worker nodes, ...
Read more >
Distributed training with TensorFlow
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can ...
Read more >
What is distributed training? | Anyscale
The gist is that distributed training tools spread the training workload within a cluster and on a local workstation with multiple CPUs. Let's ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found