Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Single machine with multiple GPU raise Found at least two devices Error

See original GitHub issue

I am running on a single machine with multiple GPU

and raise Error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 300, in distributed_main
    main(args, init_distributed=True)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in train_step
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in <genexpr>
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped
    return f(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Below is my shell script

> GPU_ID=0,1,2,3
> DATA_BIN_DIR="output/bin"
> OUT_DIR="output/models"
> BATCH_SIZE=128
> MAX_TOKENS=10000
> SEED=1
> 
> CUDA_VISIBLE_DEVICES="${GPU_ID}" python3 -m fairseq_cli.train \
>     ${DATA_BIN_DIR} \
>     --save-dir ${OUT_DIR} \
>     -a fconv \
>     --num-workers=4 --skip-invalid-size-inputs-valid-test \
>     --encoder-embed-dim 300 \
>     --decoder-embed-dim 300 \
>     --decoder-out-embed-dim 500 \
>     --encoder-layers '[(1024,3)] * 5' --decoder-layers '[(1024,3)] * 5' \
>     --dropout='0.2' --clip-norm=0.1 \
>     --optimizer nag --momentum 0.99 \
>     --lr-scheduler=reduce_lr_on_plateau --lr=0.25 --lr-shrink=0.1 --min-lr=1e-4 \
>     --max-epoch 100 \
>     --batch-size ${BATCH_SIZE} \
>     --max-tokens ${MAX_TOKENS} \
>     --seed ${SEED}
>

Am I missing anything?

python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0
pyTorch VERSION: 1.7.1

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

1reaction

Hexa4Ccommented, Apr 9, 2021

I am running on a single machine with multiple GPU

and raise Error
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 300, in distributed_main
    main(args, init_distributed=True)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 87, in main
    train(args, trainer, task, epoch_itr)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq_cli/train.py", line 130, in train
    log_output = trainer.train_step(samples)
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in train_step
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/fairseq/trainer.py", line 400, in <genexpr>
    assert all(norm == prev_norms[0] for norm in prev_norms) or all(
  File "/home/w29758143/anaconda3/envs/pycorrector/lib/python3.7/site-packages/torch/tensor.py", line 27, in wrapped
    return f(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Below is my shell script

> GPU_ID=0,1,2,3
> DATA_BIN_DIR="output/bin"
> OUT_DIR="output/models"
> BATCH_SIZE=128
> MAX_TOKENS=10000
> SEED=1
> 
> CUDA_VISIBLE_DEVICES="${GPU_ID}" python3 -m fairseq_cli.train \
>     ${DATA_BIN_DIR} \
>     --save-dir ${OUT_DIR} \
>     -a fconv \
>     --num-workers=4 --skip-invalid-size-inputs-valid-test \
>     --encoder-embed-dim 300 \
>     --decoder-embed-dim 300 \
>     --decoder-out-embed-dim 500 \
>     --encoder-layers '[(1024,3)] * 5' --decoder-layers '[(1024,3)] * 5' \
>     --dropout='0.2' --clip-norm=0.1 \
>     --optimizer nag --momentum 0.99 \
>     --lr-scheduler=reduce_lr_on_plateau --lr=0.25 --lr-shrink=0.1 --min-lr=1e-4 \
>     --max-epoch 100 \
>     --batch-size ${BATCH_SIZE} \
>     --max-tokens ${MAX_TOKENS} \
>     --seed ${SEED}
>

Am I missing anything?

python version 3.7.6 fairseq version 0.9.0 CUDA Version: 11.0 pyTorch VERSION: 1.7.1

I think the problem might be pytorch version. I met the same problem, downgrade pytorch to 1.4.0, then problem solved.

0reactions

ayushbitscommented, Sep 16, 2022

Re-opening after 1 year . @lazerliu Were you able totrain on torch >1.6.0 ?

Top Results From Across the Web

How can I solve this pytorch two devices error - Stack Overflow

I ran into a problem with PyTorch: Expected all tensors to be on the same device, but found at least two devices, cpu...

CUDA On Multiple Devices Error Breaking Stable Diffusion ...

I get this error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!...

Error resuming from checkpoint with multiple GPUs #11435

I started training a model on two GPUs, using the following trainer: trainer = pl.Trainer( devices = [0,2], accelerator='gpu', precision=16, max_epochs=2000 ...

nn.DataParallel: RuntimeError: Expected all tensors to be on ...

I am trying to parallelise my model to train it on 2 GPUs. ... all tensors to be on the same device, but...

“TensorFlow with multiple GPUs” - Jonathan Hui blog

If a TensorFlow operation has both CPU and GPU implementations, TensorFlow will automatically place the operation to run on a GPU device first....