question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The batch-sizes of single machine commands are not adjusted

See original GitHub issue

On the training doc, I believe we need to adjust the batch-size (or the LR) on the single machine commands to maintain the total batch-size the same.

For example, currently the ConvNeXt-S reports:

  • Multi-node: --nodes 4 --ngpus 8 --batch_size 128 --lr 4e-3
  • Single-machine: --nproc_per_node=8 --batch_size 128 --lr 4e-3 <- I believe here it should be --batch_size 512

Same applies for the other variants.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
liuzhuang13commented, Jan 24, 2022

We inherited this gradient accumulation feature from the BEiT codebase. In the coming week we would be busy with some other paper related work so I’m not sure I can help contribute this in a short time. If after one week it is still relevant or needed, I’m happy to contribute. For me the main thing to work out would be the process; the code part should be simple enough.

BTW, the conversion for single machine you mentioned (using --batch_size 512) seems very likely to OOM on a typical machine for a rather large model…

1reaction
liuzhuang13commented, Mar 15, 2022

Hi @anonymoussss,

It is possible to affect reproduction results, as each batch size will have different optimal learning rates. It is common practice to scale lr in proportion to batch size, meaning you may use 3e-3 (instead of 4e-3 for 4096) as learning rate if your effective batch size is 3072.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How does batch size affect convergence of SGD and why?
The size of mini-batches is essentially the frequency of updates: the smaller minibatches the more updates. At one extreme (minibatch=dataset) ...
Read more >
How to change the batch size during training? - Stack Overflow
For others who land here, I found the easiest way to do batch size adjustment in Keras is just to call fit more...
Read more >
Batch mode not functioning as expected #8597 - saltstack/salt
Observed behavior: Batch mode only executes commands on batch-size minions at once per timeout interval on 0.17.0 and later.
Read more >
Epoch vs Batch Size vs Iterations - Towards Data Science
We need terminologies like epochs, batch size, iterations only when the data is too big which happens all the time in machine learning...
Read more >
Optimizing Distributed and Parallel Training
Adjusting global_batch_size can affect your model convergence, which can affect your training and/or testing accuracy. You may need to adjust model ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found