question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Single machine with multiple GPU raise Found at least two devices Error

See original GitHub issue

I am running on a single machine with multiple GPU

and raise Error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 300, in distributed_main
    main(args, init_distributed=True)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in train_step
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in <genexpr>
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped
    return f(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Below is my shell script

> GPU_ID=0,1,2,3
> DATA_BIN_DIR="output/bin"
> OUT_DIR="output/models"
> BATCH_SIZE=128
> MAX_TOKENS=10000
> SEED=1
> 
> CUDA_VISIBLE_DEVICES="${GPU_ID}" python3 -m fairseq_cli.train \
>     ${DATA_BIN_DIR} \
>     --save-dir ${OUT_DIR} \
>     -a fconv \
>     --num-workers=4 --skip-invalid-size-inputs-valid-test \
>     --encoder-embed-dim 300 \
>     --decoder-embed-dim 300 \
>     --decoder-out-embed-dim 500 \
>     --encoder-layers '[(1024,3)] * 5' --decoder-layers '[(1024,3)] * 5' \
>     --dropout='0.2' --clip-norm=0.1 \
>     --optimizer nag --momentum 0.99 \
>     --lr-scheduler=reduce_lr_on_plateau --lr=0.25 --lr-shrink=0.1 --min-lr=1e-4 \
>     --max-epoch 100 \
>     --batch-size ${BATCH_SIZE} \
>     --max-tokens ${MAX_TOKENS} \
>     --seed ${SEED}
>     

Am I missing anything?

python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0
pyTorch VERSION: 1.7.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
Hexa4Ccommented, Apr 9, 2021

I am running on a single machine with multiple GPU

and raise Error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 300, in distributed_main
    main(args, init_distributed=True)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in train_step
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in <genexpr>
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped
    return f(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Below is my shell script

> GPU_ID=0,1,2,3
> DATA_BIN_DIR="output/bin"
> OUT_DIR="output/models"
> BATCH_SIZE=128
> MAX_TOKENS=10000
> SEED=1
> 
> CUDA_VISIBLE_DEVICES="${GPU_ID}" python3 -m fairseq_cli.train \
>     ${DATA_BIN_DIR} \
>     --save-dir ${OUT_DIR} \
>     -a fconv \
>     --num-workers=4 --skip-invalid-size-inputs-valid-test \
>     --encoder-embed-dim 300 \
>     --decoder-embed-dim 300 \
>     --decoder-out-embed-dim 500 \
>     --encoder-layers '[(1024,3)] * 5' --decoder-layers '[(1024,3)] * 5' \
>     --dropout='0.2' --clip-norm=0.1 \
>     --optimizer nag --momentum 0.99 \
>     --lr-scheduler=reduce_lr_on_plateau --lr=0.25 --lr-shrink=0.1 --min-lr=1e-4 \
>     --max-epoch 100 \
>     --batch-size ${BATCH_SIZE} \
>     --max-tokens ${MAX_TOKENS} \
>     --seed ${SEED}
>     

Am I missing anything?

python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0 pyTorch VERSION: 1.7.1

I think the problem might be pytorch version. I met the same problem, downgrade pytorch to 1.4.0, then problem solved.

0reactions
ayushbitscommented, Sep 16, 2022

Re-opening after 1 year . @lazerliu Were you able totrain on torch >1.6.0 ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I solve this pytorch two devices error - Stack Overflow
I ran into a problem with PyTorch: Expected all tensors to be on the same device, but found at least two devices, cpu...
Read more >
CUDA On Multiple Devices Error Breaking Stable Diffusion ...
I get this error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!...
Read more >
Error resuming from checkpoint with multiple GPUs #11435
I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000 ...
Read more >
nn.DataParallel: RuntimeError: Expected all tensors to be on ...
I am trying to parallelise my model to train it on 2 GPUs. ... all tensors to be on the same device, but...
Read more >
“TensorFlow with multiple GPUs” - Jonathan Hui blog
If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found