Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update no_trainer scripts to include gradient accumulation

See original GitHub issue

Feature request

🤗 Accelerate has a gradient accumulation wrapper, and the no_trainer scripts should be updated to include it!

An example can be seen here, below is an example diff of what the integration would look like:

-     accelerator = (
-         Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator()
-     )
+     accelerator = (
+         Accelerator(log_with=args.report_to, logging_dir=args.output_dir, gradient_accumulation_steps=args.gradient_accumulation_steps) if args.with_tracking else Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps)
+     )

As well as:

-     num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
+     num_update_steps_per_epoch = len(train_dataloader)

...


for step, batch in enumerate(train_dataloader):
+     with accelerator.accumulate(model):

-             loss = loss / args.gradient_accumulation_steps
            accelerator.backward(loss)
-             if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
                progress_bar.update(1)
                completed_steps += 1

The list of available scripts to update include:

examples/pytorch/image-classification/run_image_classification_no_trainer.py
examples/pytorch/language-modeling/run_clm_no_trainer.py
examples/pytorch/language-modeling/run_mlm_no_trainer.py
examples/pytorch/multiple-choice/run_swag_no_trainer.py
examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py
examples/pytorch/question_answering/run_qa_no_trainer.py
examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py
examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py
examples/pytorch/summarization/run_summarization_no_trainer.py

Motivation

This is a great first issue for someone who wants to learn how to use some of the latest bits in Accelerate and get an easy beginner contribution to the library 🤗

Your contribution

If you decide to pick up this issue, feel free to ping myself (@muellerzr), @sgugger, or @pacman100 to review 🤗

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

3reactions

sguggercommented, Aug 4, 2022

No we can’t do this as then the user would have to know in advance the number of optimization steps when they create their scheduler (which they don’t since Accelerate handles gradient accumulation behind the scenes). That’s why the learning rate scheduler should be created with the full number of training batches prior to gradient accumulation, then stepped at each batch (which is roughly equivalent to creating it with the right number of optimization batches and step at every optimization step).

1reaction

muellerzrcommented, Aug 4, 2022

I think either option would work fine as well. The reason behind sync_gradients as part of the Accelerator is to provide this open interface to perform a check like this, so from an API design it’s correct.

My $0.02 is to either explain in a comment what sync_gradients checks briefly, or to do as Sylvain recommended here.

Top Results From Across the Web

Performing gradient accumulation with Accelerate

Gradient accumulation is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into...

Transformers: State-of-the-Art Natural Language Processing

... Update no_trainer.py scripts to include accelerate gradient accumulation wrapper by @Rasmusafj in #18473; Add Spanish translation of ...

<no title> — darts documentation - GitHub Pages

If gradient accumulation is used, the loss here holds the normalized ... method ( Optional [ str ]) – Whether to use TorchScript's...

Gradient Accumulation in PyTorch - Nikita Kozodoi

Simply speaking, gradient accumulation means that we will use a small batch size but save the gradients and update network weights once ...

How to accumulate gradients for large batch sizes in Keras

The main idea is to use a flag to determine whether to update the ... of code to your Python script, and you...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Update no_trainer scripts to include gradient accumulation

Feature request

Motivation

Your contribution

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Update no_trainer scripts to include gather_for_metrics

BlenderBot-Distil-400M training fails if the input or target length exceeds a certain threshold, even when truncation and padding is on