The batch-sizes of single machine commands are not adjusted
See original GitHub issueOn the training doc, I believe we need to adjust the batch-size (or the LR) on the single machine commands to maintain the total batch-size the same.
For example, currently the ConvNeXt-S reports:
- Multi-node:
--nodes 4 --ngpus 8 --batch_size 128 --lr 4e-3
- Single-machine:
--nproc_per_node=8 --batch_size 128 --lr 4e-3
<- I believe here it should be--batch_size 512
Same applies for the other variants.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
How does batch size affect convergence of SGD and why?
The size of mini-batches is essentially the frequency of updates: the smaller minibatches the more updates. At one extreme (minibatch=dataset) ...
Read more >How to change the batch size during training? - Stack Overflow
For others who land here, I found the easiest way to do batch size adjustment in Keras is just to call fit more...
Read more >Batch mode not functioning as expected #8597 - saltstack/salt
Observed behavior: Batch mode only executes commands on batch-size minions at once per timeout interval on 0.17.0 and later.
Read more >Epoch vs Batch Size vs Iterations - Towards Data Science
We need terminologies like epochs, batch size, iterations only when the data is too big which happens all the time in machine learning...
Read more >Optimizing Distributed and Parallel Training
Adjusting global_batch_size can affect your model convergence, which can affect your training and/or testing accuracy. You may need to adjust model ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We inherited this gradient accumulation feature from the BEiT codebase. In the coming week we would be busy with some other paper related work so I’m not sure I can help contribute this in a short time. If after one week it is still relevant or needed, I’m happy to contribute. For me the main thing to work out would be the process; the code part should be simple enough.
BTW, the conversion for single machine you mentioned (using
--batch_size 512
) seems very likely to OOM on a typical machine for a rather large model…Hi @anonymoussss,
It is possible to affect reproduction results, as each batch size will have different optimal learning rates. It is common practice to scale lr in proportion to batch size, meaning you may use 3e-3 (instead of 4e-3 for 4096) as learning rate if your effective batch size is 3072.