Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The batch-sizes of single machine commands are not adjusted

See original GitHub issue

On the training doc, I believe we need to adjust the batch-size (or the LR) on the single machine commands to maintain the total batch-size the same.

For example, currently the ConvNeXt-S reports:

Multi-node: --nodes 4 --ngpus 8 --batch_size 128 --lr 4e-3
Single-machine: --nproc_per_node=8 --batch_size 128 --lr 4e-3 <- I believe here it should be --batch_size 512

Same applies for the other variants.

Issue Analytics

State:
Created 2 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

liuzhuang13commented, Jan 24, 2022

We inherited this gradient accumulation feature from the BEiT codebase. In the coming week we would be busy with some other paper related work so I’m not sure I can help contribute this in a short time. If after one week it is still relevant or needed, I’m happy to contribute. For me the main thing to work out would be the process; the code part should be simple enough.

BTW, the conversion for single machine you mentioned (using --batch_size 512) seems very likely to OOM on a typical machine for a rather large model…

1reaction

liuzhuang13commented, Mar 15, 2022

Hi @anonymoussss,

It is possible to affect reproduction results, as each batch size will have different optimal learning rates. It is common practice to scale lr in proportion to batch size, meaning you may use 3e-3 (instead of 4e-3 for 4096) as learning rate if your effective batch size is 3072.

Top Results From Across the Web

How does batch size affect convergence of SGD and why?

The size of mini-batches is essentially the frequency of updates: the smaller minibatches the more updates. At one extreme (minibatch=dataset) ...

How to change the batch size during training? - Stack Overflow

For others who land here, I found the easiest way to do batch size adjustment in Keras is just to call fit more...

Batch mode not functioning as expected #8597 - saltstack/salt

Observed behavior: Batch mode only executes commands on batch-size minions at once per timeout interval on 0.17.0 and later.

Epoch vs Batch Size vs Iterations - Towards Data Science

We need terminologies like epochs, batch size, iterations only when the data is too big which happens all the time in machine learning...

Optimizing Distributed and Parallel Training

Adjusting global_batch_size can affect your model convergence, which can affect your training and/or testing accuracy. You may need to adjust model ...