Single machine with multiple GPU raise Found at least two devices Error
See original GitHub issueI am running on a single machine with multiple GPU
and raise Error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 300, in distributed_main
main(args, init_distributed=True)
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 87, in main
train(args, trainer, task, epoch_itr)
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 130, in train
log_output = trainer.train_step(samples)
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in train_step
assert all(norm == prev_norms[0] for norm in prev_norms) or all(
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in <genexpr>
assert all(norm == prev_norms[0] for norm in prev_norms) or all(
File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped
return f(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
Below is my shell script
> GPU_ID=0,1,2,3
> DATA_BIN_DIR="output/bin"
> OUT_DIR="output/models"
> BATCH_SIZE=128
> MAX_TOKENS=10000
> SEED=1
>
> CUDA_VISIBLE_DEVICES="${GPU_ID}" python3 -m fairseq_cli.train \
> ${DATA_BIN_DIR} \
> --save-dir ${OUT_DIR} \
> -a fconv \
> --num-workers=4 --skip-invalid-size-inputs-valid-test \
> --encoder-embed-dim 300 \
> --decoder-embed-dim 300 \
> --decoder-out-embed-dim 500 \
> --encoder-layers '[(1024,3)] * 5' --decoder-layers '[(1024,3)] * 5' \
> --dropout='0.2' --clip-norm=0.1 \
> --optimizer nag --momentum 0.99 \
> --lr-scheduler=reduce_lr_on_plateau --lr=0.25 --lr-shrink=0.1 --min-lr=1e-4 \
> --max-epoch 100 \
> --batch-size ${BATCH_SIZE} \
> --max-tokens ${MAX_TOKENS} \
> --seed ${SEED}
>
Am I missing anything?
python version 3.7.6
fairseq version 0.9.0
CUDA Version: 11.0
pyTorch VERSION: 1.7.1
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Top Results From Across the Web
How can I solve this pytorch two devices error - Stack Overflow
I ran into a problem with PyTorch: Expected all tensors to be on the same device, but found at least two devices, cpu...
Read more >CUDA On Multiple Devices Error Breaking Stable Diffusion ...
I get this error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!...
Read more >Error resuming from checkpoint with multiple GPUs #11435
I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000 ...
Read more >nn.DataParallel: RuntimeError: Expected all tensors to be on ...
I am trying to parallelise my model to train it on 2 GPUs. ... all tensors to be on the same device, but...
Read more >“TensorFlow with multiple GPUs” - Jonathan Hui blog
If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think the problem might be pytorch version. I met the same problem, downgrade pytorch to 1.4.0, then problem solved.
Re-opening after 1 year . @lazerliu Were you able totrain on torch >1.6.0 ?