Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training looks borken in Trainer

See original GitHub issue

It looks like distributed training is broken with the Trainer.

The number of GPUs is inferred here, https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L150

however, it throws an exception if the number of GPUs > 1. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/utils/trainer_utils.py#L13-L18

Distributed training on the other hand is only inited if the number of GPUs is > 1, so it is never called. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L212-L219

Notice also that the expected flow here is that the user calls distribute.py which in turn creates a process for each GPU, with the visible devices set to one of each GPU. This means that when attempting towered training (multiple GPUs, same machine), each process will run their own single GPU training. So clearly, there needs to be some way of communicating the number of GPUs in total to use, between processes.

I would propose the following:

Replace distributed training init method with “env://” (https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) to load from the environment variables MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK
Remove config.distributed_url (instead relying on MASTER_ADDR or defaulting to localhost)
Let distribute.py use the same environment vars but update RANK for each device (CUDA_VISIBLE_DEVICES should be set as before)

This has the added advantage that when using a cloud platform (Vertex AI, Kubeflow, and I suspect most others) the env variables are populated automatically for multi-machine training, so one can call the target script without any additional work. In this case the URL would be impossible to know ahead of time.

If you agree with this method I’d be happy to implement it.