Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi GPU Training Error When Using a Training Recipe

See original GitHub issue

Describe the bug

When I run this command on a box with four local GPUs (docker) following the training_a_model doc: CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py

I get this exception: " my_env["CUDA_VISIBLE_DEVICES"] = f"{','.join(gpus)}" TypeError: sequence item 0: expected str instance, int found".

If I run it without trainer.distribute module: CUDA_VISIBLE_DEVICES="0,1,2,3" python3 recipes/ljspeech/fast_speech/train_fast_speech.py

the error is "RuntimeError: [!] 4 active GPUs. Define the target GPU by CUDA_VISIBLE_DEVICES. For multi-gpu training use TTS/bin/distribute.py."

This commands (single GPU) works fine: "CUDA_VISIBLE_DEVICES="0" python3 recipes/ljspeech/fast_speech/train_fast_speech.py"

To Reproduce

CUDA_VISIBLE_DEVICES="0, 1, 2, 3" python3 -m trainer.distribute --script recipes/ljspeech/fast_speech/train_fast_speech.py

Expected behavior

Training should start on four GPUs.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#1 SMP Tue Apr 26 20:14:22 UTC 2022"
    }
}

Additional context

No response

Issue Analytics

State:
Created a year ago
Comments:9 (7 by maintainers)

Top GitHub Comments

2reactions

iprovalocommented, May 31, 2022

I don’t think the space is the cause of the TypeError here. I believe there are a couple of issues:

Attempting to format a list of int, instead of a str on line 58, which manifests in the error TypeError: sequence item 0: expected str instance, int found. This could be fixed with a mapping to str: f"{','.join(list(map(str, range(gpus))))}"
The approach of using the torch.cuda.device_count() with a range() function on lines 29-30 ignores the actual values of the env variable CUDA_VISIBLE_DEVICES. In other words, even if the fix in the first suggestion was implemented, and one would set CUDA_VISIBLE_DEVICES=0,1, the trainer would initialize all four gpus based on the range(torch.cuda.device_count())

So, instead of fixing line 58, I would probably not use the range(gpus), but process the env var with the split similar to line 32. The reason line 32 is not working with spaces (–gpus “0, 1”) is because each value is not trimmed.

Here is a proposed fix with trim to replace the lines 28-33:

if "CUDA_VISIBLE_DEVICES" in os.environ:
    gpus = os.environ['CUDA_VISIBLE_DEVICES']
else:
    gpus = args.gpus

gpus = list(map(str.strip, gpus.split(",")))

0reactions

iprovalocommented, Jul 3, 2022

This is fixed.

Top Results From Across the Web

[Bug] AlignTTS recipe fails with multi-GPU training · Issue #2191

The training of aligner recipes/ljspeech/align_tts fails when attempting multi-gpu training with error: " ValueError: sampler option is mutually exclusive ...

Training with multiple GPUs has error using TAO toolkit

I am using the following command to train maskrcnn. If I set --gpus 1 , it is fine. If I set 4, I...

Multi GPU training with DDP - PyTorch

In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....

Multi-GPU training error(OOM) on keras (sufficient memory ...

I am using keras to train my model on ImageNet2012. When I use a batch size of 256 on a single GPU, it...

GPU support — Dataiku DSS 11 documentation

Even if training is possible on CPUs, we recommend that you use GPUs to train your model, as it will significantly reduce the...