question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi GPU Training Error When Using a Training Recipe

See original GitHub issue

Describe the bug

When I run this command on a box with four local GPUs (docker) following the training_a_model doc: CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py

I get this exception: " my_env["CUDA_VISIBLE_DEVICES"] = f"{','.join(gpus)}" TypeError: sequence item 0: expected str instance, int found".

If I run it without trainer.distribute module: CUDA_VISIBLE_DEVICES="0,1,2,3" python3 recipes/ljspeech/fast_speech/train_fast_speech.py

the error is "RuntimeError: [!] 4 active GPUs. Define the target GPU by CUDA_VISIBLE_DEVICES. For multi-gpu training use TTS/bin/distribute.py."

This commands (single GPU) works fine: "CUDA_VISIBLE_DEVICES="0" python3 recipes/ljspeech/fast_speech/train_fast_speech.py"

To Reproduce

CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py

Expected behavior

Training should start on four GPUs.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#1 SMP Tue Apr 26 20:14:22 UTC 2022"
    }
}

Additional context

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
iprovalocommented, May 31, 2022

I don’t think the space is the cause of the TypeError here. I believe there are a couple of issues:

  1. Attempting to format a list of int, instead of a str on line 58, which manifests in the error TypeError: sequence item 0: expected str instance, int found. This could be fixed with a mapping to str: f"{','.join(list(map(str, range(gpus))))}"

  2. The approach of using the torch.cuda.device_count() with a range() function on lines 29-30 ignores the actual values of the env variable CUDA_VISIBLE_DEVICES. In other words, even if the fix in the first suggestion was implemented, and one would set CUDA_VISIBLE_DEVICES=0,1, the trainer would initialize all four gpus based on the range(torch.cuda.device_count())

So, instead of fixing line 58, I would probably not use the range(gpus), but process the env var with the split similar to line 32. The reason line 32 is not working with spaces (–gpus “0, 1”) is because each value is not trimmed.

Here is a proposed fix with trim to replace the lines 28-33:

if "CUDA_VISIBLE_DEVICES" in os.environ:
    gpus = os.environ['CUDA_VISIBLE_DEVICES']
else:
    gpus = args.gpus

gpus = list(map(str.strip, gpus.split(",")))
0reactions
iprovalocommented, Jul 3, 2022

This is fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Bug] AlignTTS recipe fails with multi-GPU training · Issue #2191
The training of aligner recipes/ljspeech/align_tts fails when attempting multi-gpu training with error: " ValueError: sampler option is mutually exclusive ...
Read more >
Training with multiple GPUs has error using TAO toolkit
I am using the following command to train maskrcnn. If I set --gpus 1 , it is fine. If I set 4, I...
Read more >
Multi GPU training with DDP - PyTorch
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....
Read more >
Multi-GPU training error(OOM) on keras (sufficient memory ...
I am using keras to train my model on ImageNet2012. When I use a batch size of 256 on a single GPU, it...
Read more >
GPU support — Dataiku DSS 11 documentation
Even if training is possible on CPUs, we recommend that you use GPUs to train your model, as it will significantly reduce the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found