Distributed training looks borken in Trainer
See original GitHub issueIt looks like distributed training is broken with the Trainer.
The number of GPUs is inferred here, https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L150
however, it throws an exception if the number of GPUs > 1. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/utils/trainer_utils.py#L13-L18
Distributed training on the other hand is only inited if the number of GPUs is > 1, so it is never called. https://github.com/coqui-ai/TTS/blob/3d614b3ca9dbc365996facca59b594222acc2d9b/TTS/trainer.py#L212-L219
Notice also that the expected flow here is that the user calls distribute.py
which in turn creates a process for each GPU, with the visible devices set to one of each GPU. This means that when attempting towered training (multiple GPUs, same machine), each process will run their own single GPU training. So clearly, there needs to be some way of communicating the number of GPUs in total to use, between processes.
I would propose the following:
- Replace distributed training init method with “env://” (https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization) to load from the environment variables
MASTER_ADDR
,MASTER_PORT
,WORLD_SIZE
,RANK
- Remove config.distributed_url (instead relying on
MASTER_ADDR
or defaulting tolocalhost
) - Let
distribute.py
use the same environment vars but updateRANK
for each device (CUDA_VISIBLE_DEVICES
should be set as before)
This has the added advantage that when using a cloud platform (Vertex AI, Kubeflow, and I suspect most others) the env variables are populated automatically for multi-machine training, so one can call the target script without any additional work. In this case the URL would be impossible to know ahead of time.
If you agree with this method I’d be happy to implement it.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (7 by maintainers)
Top GitHub Comments
This needs some pondering. Let me put some thought into it and get back to you.
Fixed in the current version