Multi GPU Training Error When Using a Training Recipe
See original GitHub issueDescribe the bug
When I run this command on a box with four local GPUs (docker) following the training_a_model doc:
CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py
I get this exception:
" my_env["CUDA_VISIBLE_DEVICES"] = f"{','.join(gpus)}" TypeError: sequence item 0: expected str instance, int found".
If I run it without trainer.distribute module:
CUDA_VISIBLE_DEVICES="0,1,2,3" python3 recipes/ljspeech/fast_speech/train_fast_speech.py
the error is
"RuntimeError: [!] 4 active GPUs. Define the target GPU by CUDA_VISIBLE_DEVICES. For multi-gpu training use TTS/bin/distribute.py."
This commands (single GPU) works fine:
"CUDA_VISIBLE_DEVICES="0" python3 recipes/ljspeech/fast_speech/train_fast_speech.py"
To Reproduce
CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py
Expected behavior
Training should start on four GPUs.
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"Tesla T4",
"Tesla T4",
"Tesla T4",
"Tesla T4"
],
"available": true,
"version": "10.2"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu102",
"TTS": "0.6.2",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.7.13",
"version": "#1 SMP Tue Apr 26 20:14:22 UTC 2022"
}
}
Additional context
No response
Issue Analytics
- State:
- Created a year ago
- Comments:9 (7 by maintainers)
Top GitHub Comments
I don’t think the space is the cause of the TypeError here. I believe there are a couple of issues:
Attempting to format a list of
int
, instead of astr
on line 58, which manifests in the errorTypeError: sequence item 0: expected str instance, int found
. This could be fixed with a mapping to str:f"{','.join(list(map(str, range(gpus))))}"
The approach of using the torch.cuda.device_count() with a range() function on lines 29-30 ignores the actual values of the env variable
CUDA_VISIBLE_DEVICES
. In other words, even if the fix in the first suggestion was implemented, and one would setCUDA_VISIBLE_DEVICES=0,1
, the trainer would initialize all four gpus based on therange(torch.cuda.device_count())
So, instead of fixing line 58, I would probably not use the
range(gpus)
, but process the env var with the split similar to line 32. The reason line 32 is not working with spaces (–gpus “0, 1”) is because each value is not trimmed.Here is a proposed fix with trim to replace the lines 28-33:
This is fixed.