How to specify distributed GPU numbers
See original GitHub issueHi:
I want to ask if there any way to specify distributed GPU numbers
I used the following command CUDA_VISIBLE_DEVICES=1,2,3 python train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml --data_parallel_backend --data_parallel_count=3
to run my experiment, however, it seems like it still accessed the cuda:0 that I am running the other experiment
here is my error message:
speechbrain.core - Parameter is not finite: Parameter containing:
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
device='cuda:0', requires_grad=True)
speechbrain.core - Parameter is not finite: Parameter containing:
tensor([[-2.6658e-02, 8.7483e-03, -3.4362e-02, ..., -1.9040e-02,
2.2166e-02, -3.0037e-02],
[-6.8921e-04, -7.1734e-05, 1.5596e-04, ..., 1.7386e-06,
5.4680e-06, -8.7531e-05],
[ 7.4638e-04, -5.4538e-05, 1.0013e-05, ..., 1.7373e-06,
6.9947e-06, -1.4809e-04],
...,
[-1.0935e-03, -7.2480e-05, -3.2643e-05, ..., -1.0808e-05,
6.3613e-05, -1.1533e-03],
[ 3.3602e-04, -4.2033e-05, 8.1406e-06, ..., 1.3459e-06,
2.7885e-06, -4.5925e-06],
[-1.0819e-06, -5.2945e-07, 1.1754e-07, ..., 1.6078e-08,
3.3266e-08, -2.5519e-07]], device='cuda:0', requires_grad=True)
78%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 30180/38732 [05:25<41:28, 3.44it/s, train_loss=0.198]
speechbrain.core - Exception:
Traceback (most recent call last):
File "train_speaker_embeddings.py", line 249, in <module>
speaker_brain.fit(
File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 1013, in fit
loss = self.fit_batch(batch)
File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 842, in fit_batch
if self.check_gradients(loss):
File "/mnt/md2/user_winston/speechbrain/speechbrain/core.py", line 875, in check_gradients
raise ValueError(
ValueError: Loss is not finite and patience is exhausted. To debug, wrap fit() with autograd's `detect_anomaly()`, e.g.
with torch.autograd.detect_anomaly():
brain.fit(...)
Recipe: speechbrain/recipes/VoxCeleb/SpeakerRe Python: 3.8
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Multi-GPU and distributed training - Keras
In this setup, you have one machine with several GPUs on it (typically 2 to 8). Each device will run a copy of...
Read more >Distributed GPU training guide (SDK v2) - Azure
Use the distribution parameter of the command to specify settings for MpiDistribution .
Read more >Why and How to Use Multiple GPUs for Distributed Training
When one GPU isn't enough for deep learning, we know that we should probably use more. But why and how?
Read more >How to scale training on multiple GPUs - Towards Data Science
Line 2β6: We instantiate the model and set it to run in the specified GPU, and run our operations in multiple GPUs in...
Read more >Efficient Training on Multiple GPUs - Hugging Face
Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @aheba, GPU used is 0,1,2,3 (all of my GPUs), instead of 1,2,3
@jamfly done