question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training looks borken in Trainer

See original GitHub issue

It looks like distributed training is broken with the Trainer.

The number of GPUs is inferred here, https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L150

however, it throws an exception if the number of GPUs > 1. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/utils/trainer_utils.py#L13-L18

Distributed training on the other hand is only inited if the number of GPUs is > 1, so it is never called. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L212-L219

Notice also that the expected flow here is that the user calls distribute.py which in turn creates a process for each GPU, with the visible devices set to one of each GPU. This means that when attempting towered training (multiple GPUs, same machine), each process will run their own single GPU training. So clearly, there needs to be some way of communicating the number of GPUs in total to use, between processes.

I would propose the following:

  • Replace distributed training init method with “env://” (https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) to load from the environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK
  • Remove config.distributed_url (instead relying on MASTER_ADDR or defaulting to localhost)
  • Let distribute.py use the same environment vars but update RANK for each device (CUDA_VISIBLE_DEVICES should be set as before)

This has the added advantage that when using a cloud platform (Vertex AI, Kubeflow, and I suspect most others) the env variables are populated automatically for multi-machine training, so one can call the target script without any additional work. In this case the URL would be impossible to know ahead of time.

If you agree with this method I’d be happy to implement it.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
erogolcommented, Aug 5, 2021

This needs some pondering. Let me put some thought into it and get back to you.

0reactions
erogolcommented, Sep 10, 2021

Fixed in the current version

Read more comments on GitHub >

github_iconTop Results From Across the Web

CPU training is broken for more than one process · Issue #3334
Trainer with num_processes=2 . Using pytorch-lightning==0.9.0 , it does crash. It appears this is a bug. In the mean time it is important ......
Read more >
Distributed training questions - Gluon - Apache MXNet Forum
Dear all, I am getting my hands dirty with asynchronous distributed training. All looks good, and this suggested tutorial is awesome.
Read more >
Multi GPU Model Training: Monitoring and Optimizing
Do you struggle with monitoring and optimizing the training of Deep Neural Networks on multiple GPUs? If yes, you're in the right place....
Read more >
Deep Learning: A Primer on Distributed Training — Part 1
Before we jump into it, let us quickly look at what a typical machine ... An epoch contains multiple training steps which is...
Read more >
Distributed Training with PyTorch on Piz Daint - Day 1a
The Piz Daint supercomputer at CSCS provides an ideal platform for supporting intensive deep learning workloads as it comprises thousands of ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found