question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training on Single GPU

See original GitHub issue

Thanks for the exciting work.

I am trying to finetune on my classification (imagenet) like dataset on 1 GPU using following command.

python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --model xcit_nano_12_p16 --batch-size 16 --drop-path 0.05 --output_dir experiments/xcit_nano_12_p16/ --epochs 30 --pretrained /mnt/hdd1/Projects/XCiT/xcit_nano_12_p16_224.pth

But it fails with following error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 1, 128]], which is output 0 of SliceBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

what could be done to resolve this? I am new to distributed training .

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
dwhite54commented, Aug 6, 2021

I found the culprit (I was way off before). A workaround is to set tokens_norm=True (here for example). Going by the comments, this will hurt your performance if you’re just doing inference with a pretrained xcit_nano.

1reaction
trathpaicommented, Aug 12, 2021

Solution by @dwhite54 above works , so closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >
Single Node, Single GPU Training - Flyte
Training a model on a single node on one GPU is as trivial as writing any Flyte task and simply setting the GPU...
Read more >
6-3 Model Training Using Single GPU
6-3 Model Training Using Single GPU#. The training procedure of deep learning is usually time consuming. It even takes tens of days for...
Read more >
Embedding Training With 1% GPU Memory and 100 Times ...
Hybrid Training: This method starts by splitting the embedding table into two parts, one trained on the GPU and the other trained on...
Read more >
Train 18-billion-parameter GPT models with a single GPU on ...
For the representative of large models — GPT, Colossal-AI is capable of training it with up to 1.5 billion parameters on a gaming...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found