Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training on Single GPU

See original GitHub issue

Thanks for the exciting work.

I am trying to finetune on my classification (imagenet) like dataset on 1 GPU using following command.

python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --model xcit_nano_12_p16 --batch-size 16 --drop-path 0.05 --output_dir experiments/xcit_nano_12_p16/ --epochs 30 --pretrained /mnt/hdd1/Projects/XCiT/xcit_nano_12_p16_224.pth

But it fails with following error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 1, 128]], which is output 0 of SliceBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

what could be done to resolve this? I am new to distributed training .

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

2reactions

dwhite54commented, Aug 6, 2021

I found the culprit (I was way off before). A workaround is to set tokens_norm=True (here for example). Going by the comments, this will hurt your performance if you’re just doing inference with a pretrained xcit_nano.

1reaction

trathpaicommented, Aug 12, 2021

Solution by @dwhite54 above works , so closing.