Training slows to a halt after iteration 5000
See original GitHub issueI’ve observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?
CentOS, pytorch 1.7, training on 8 RTX 6000s with the command
stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1
nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?
Issue Analytics
- State:
- Created 3 years ago
- Comments:8
Top Results From Across the Web
Tensorflow predict/train step has massive slowdown after
The training starts well, but after many iterations of the above (~1.5k over about 5 hours), the training suddenly grinds to a halt...
Read more >Training gets slow down by each batch slowly - PyTorch Forums
As for generating training data on-the-fly, the speed is very fast at beginning but significantly slow down after a few iterations (3000). At ......
Read more >Why does neural network learning slow down as the error gets ...
The reasons for the slowdown are not fully understood, but we have some basic ideas. For classifiers, most training examples start out as...
Read more >Troubleshooting Slow Running Flows - Microsoft Support
If you have 'Do until' or 'For each item' loops in your flow, see if you can reduce the number of loop iterations,...
Read more >1915.159 - Personal fall arrest systems (PFAS).
Anchorages shall be capable of supporting at least 5,000 pounds (22.24 Kn) per ... Bring a falling employee to a complete stop and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I just had a weird occurrence: At around 39000 iterations, progress stopped. By all indications, it seemed like it was still “working on something” but it couldn’t get past its current iter (usually it takes about 5 secs per iteration on my hardware, but it had been stuck here for about half an hour). So I hit Ctrl-C in the ananconda window, thinking to restart it. Instead, the Ctrl-C seems to have nudged it into action again. I’m not sure how or why? But in any case, it is crunching again.
Adding
-no_pl_reg True
fixed the problem, thank you.I did a quick search and found path penalty kick in after step 5000 indeed.
apply_path_penalty = not self.no_pl_reg and self.steps > 5000 and self.steps % 32 == 0
The problem is, will turning off path penalty causes negative effect to the training result…? No matter what, I’m turning it off because the training speed is too slow.