question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError with multiple GPU

See original GitHub issue

System Info

Red Hat Server 7.7 Pytorch: 1.6.0 Transformers: 3.0.2 Python: 3.7.6 Number of GPU: 4

Question

I am trying to finetune a GPT2 model using Trainer with multiple GPU installed on my machine. However, I get the following error:

Traceback (most recent call last):
  File "run_finetune_gpt2.py", line 158, in <module>
    main()
  File "run_finetune_gpt2.py", line 145, in main
    trainer.train()
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: You can sync this run to the cloud by running:
wandb: wandb sync wandb/dryrun-20200914_134757-1sih3p0q

Any ideas about what might be going on? Thanks in advance!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Sep 22, 2020

There was no error because the tensors were set on the only GPU you add when back from numpy but the gradients were still wrong (basically everything that happened before the numpy part was wiped out).

0reactions
stale[bot]commented, Nov 24, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

AssertionError: The number of GPUs ([1]) must be the same as ...
Yeah, I tried with GPU_INDEX = 0 for 1 GPU it started the training. but for 2 GPU, it throws : AssertionError: The...
Read more >
Multi GPU GRU AssertionError - PyTorch Forums
I wrap my module with DataParallel. I use device_ids=[2, 3] when forward, the gru part will raise AssertionError class Multi(nn.
Read more >
Multi-GPU Training - YOLOv5 Documentation
This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s). UPDATED 25 September...
Read more >
How to Fix "AssertionError: CUDA unavailable, invalid device ...
1- After download NVIDIA Driver: Go to your window and search for "NVIDIA Control Panel"; Then at the bottom left there should be...
Read more >
Topaz Extract assertion error - Particle Picking
Hi all, I have been getting an “AssertionError: No particles ... 9, 10, 11] [CPU: 68.5 MB] GPU : [2] [CPU: 68.5 MB]...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found