question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to specify distributed GPU numbers

See original GitHub issue

Hi:

I want to ask if there any way to specify distributed GPU numbers

I used the following command CUDA_VISIBLE_DEVICES=1,2,3 python train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml --data_parallel_backend --data_parallel_count=3 to run my experiment, however, it seems like it still accessed the cuda:0 that I am running the other experiment

here is my error message:

speechbrain.core - Parameter is not finite: Parameter containing:
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', requires_grad=True)
speechbrain.core - Parameter is not finite: Parameter containing:
tensor([[-2.6658e-02,  8.7483e-03, -3.4362e-02,  ..., -1.9040e-02,
          2.2166e-02, -3.0037e-02],
        [-6.8921e-04, -7.1734e-05,  1.5596e-04,  ...,  1.7386e-06,
          5.4680e-06, -8.7531e-05],
        [ 7.4638e-04, -5.4538e-05,  1.0013e-05,  ...,  1.7373e-06,
          6.9947e-06, -1.4809e-04],
        ...,
        [-1.0935e-03, -7.2480e-05, -3.2643e-05,  ..., -1.0808e-05,
          6.3613e-05, -1.1533e-03],
        [ 3.3602e-04, -4.2033e-05,  8.1406e-06,  ...,  1.3459e-06,
          2.7885e-06, -4.5925e-06],
        [-1.0819e-06, -5.2945e-07,  1.1754e-07,  ...,  1.6078e-08,
          3.3266e-08, -2.5519e-07]], device='cuda:0', requires_grad=True)
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ                     | 30180/38732 [05:25<41:28,  3.44it/s, train_loss=0.198]
speechbrain.core - Exception:
Traceback (most recent call last):
  File "train_speaker_embeddings.py", line 249, in <module>
    speaker_brain.fit(
  File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 1013, in fit
    loss = self.fit_batch(batch)
  File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 842, in fit_batch
    if self.check_gradients(loss):
  File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 875, in check_gradients
    raise ValueError(
ValueError: Loss is not finite and patience is exhausted. To debug, wrap fit() with autograd's `detect_anomaly()`, e.g.

with torch.autograd.detect_anomaly():
        brain.fit(...)

Recipe: speechbrain/recipes/VoxCeleb/SpeakerRe Python: 3.8

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jamflycommented, Apr 29, 2021

Hi @aheba, GPU used is 0,1,2,3 (all of my GPUs), instead of 1,2,3

0reactions
ahebacommented, Jun 16, 2021

@jamfly done

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-GPU and distributed training - Keras
In this setup, you have one machine with several GPUs on it (typically 2 to 8). Each device will run a copy of...
Read more >
Distributed GPU training guide (SDK v2) - Azure
Use the distribution parameter of the command to specify settings for MpiDistribution .
Read more >
Why and How to Use Multiple GPUs for Distributed Training
When one GPU isn't enough for deep learning, we know that we should probably use more. But why and how?
Read more >
How to scale training on multiple GPUs - Towards Data Science
Line 2–6: We instantiate the model and set it to run in the specified GPU, and run our operations in multiple GPUs in...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found