question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using train_bert_ds.py can not converge.

See original GitHub issue

I run the example of HelloDeepSpeed, the following experiments converge normally. I can see a loss drop.

python train_bert.py --checkpoint_dir ./experiments --local_rank 0

However, train_bert_ds.py can not converge. The loss is always 10.9**.

deepspeed train_bert_ds.py --checkpoint_dir ./ds_exp

Why?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
MihaiBalintcommented, Jan 6, 2022

@tjruwase With the PR appleid, the bert example converges as expected.

0reactions
tjruwasecommented, Jan 6, 2022

@MihaiBalint, awesome! Thanks for the quick confirmation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Neural Network training with PyBrain won't converge
So the network is converging, but there is no way to get the best trained network. The documentation of PyBrain implies that the...
Read more >
can not converge if hidden_dim of transformer is setted to 512 ...
I'm trying to run training DETR with hidden_dim of transformer as 512 and getting an error I started with default command, and it...
Read more >
Neural network does not converge with negative symbols
I've created a simple 2-2-1 feedforward ANN to predict an XOR using Keras. The activation function I'm using on all layers is a...
Read more >
Training and Convergence - Databricks
A key component of most artificial intelligence and machine learning is looping, i.e. the system improving over many iterations of training.
Read more >
Why gradient descent doesn't converge with unscaled features?
In this super short blog, I have explained what happens behind the scene with our favorite Gradient Descent algorithm when it is fed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found