question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training slows to a halt after iteration 5000

See original GitHub issue

I’ve observed both in the multi-gpu and single gpu setting that after iteration 5000 the training seems to slow to a halt?

CentOS, pytorch 1.7, training on 8 RTX 6000s with the command

stylegan2_pytorch --data data image_size 512 --name name--multi_gpus --num_workers 0 --batch_size 40 --aug-prob 0.25 --attn-layers [1,2] --gradient-accumulate-every 1

nvidia-smi shows full GPU utilization, and I can see that some CPUs are still active. Has anyone experienced something like this and if so do you know the cause of this behaviour?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
canadaduanecommented, Dec 19, 2020

I just had a weird occurrence: At around 39000 iterations, progress stopped. By all indications, it seemed like it was still “working on something” but it couldn’t get past its current iter (usually it takes about 5 secs per iteration on my hardware, but it had been stuck here for about half an hour). So I hit Ctrl-C in the ananconda window, thinking to restart it. Instead, the Ctrl-C seems to have nudged it into action again. I’m not sure how or why? But in any case, it is crunching again.

0reactions
chiwing4commented, Feb 9, 2021

Adding -no_pl_reg True fixed the problem, thank you.

I did a quick search and found path penalty kick in after step 5000 indeed. apply_path_penalty = not self.no_pl_reg and self.steps > 5000 and self.steps % 32 == 0 The problem is, will turning off path penalty causes negative effect to the training result…? No matter what, I’m turning it off because the training speed is too slow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tensorflow predict/train step has massive slowdown after
The training starts well, but after many iterations of the above (~1.5k over about 5 hours), the training suddenly grinds to a halt...
Read more >
Training gets slow down by each batch slowly - PyTorch Forums
As for generating training data on-the-fly, the speed is very fast at beginning but significantly slow down after a few iterations (3000). At ......
Read more >
Why does neural network learning slow down as the error gets ...
The reasons for the slowdown are not fully understood, but we have some basic ideas. For classifiers, most training examples start out as...
Read more >
Troubleshooting Slow Running Flows - Microsoft Support
If you have 'Do until' or 'For each item' loops in your flow, see if you can reduce the number of loop iterations,...
Read more >
1915.159 - Personal fall arrest systems (PFAS).
Anchorages shall be capable of supporting at least 5,000 pounds (22.24 Kn) per ... Bring a falling employee to a complete stop and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found