Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training slows to a halt after iteration 5000

See original GitHub issue

I’ve observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

Issue Analytics

State:
Created 3 years ago
Comments:8

Top GitHub Comments

1reaction

canadaduanecommented, Dec 19, 2020

I just had a weird occurrence: At around 39000 iterations, progress stopped. By all indications, it seemed like it was still “working on something” but it couldn’t get past its current iter (usually it takes about 5 secs per iteration on my hardware, but it had been stuck here for about half an hour). So I hit Ctrl-C in the ananconda window, thinking to restart it. Instead, the Ctrl-C seems to have nudged it into action again. I’m not sure how or why? But in any case, it is crunching again.

0reactions

chiwing4commented, Feb 9, 2021

Adding -no_pl_reg True fixed the problem, thank you.

I did a quick search and found path penalty kick in after step 5000 indeed. apply_path_penalty = not self.no_pl_reg and self.steps > 5000 and self.steps % 32 == 0 The problem is, will turning off path penalty causes negative effect to the training result…? No matter what, I’m turning it off because the training speed is too slow.