question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[docs] critical API documentation is missing

See original GitHub issue

It looks like things kind of work, except not quite right, and there are a lot of subtle nuances that are so hard to know about when integrating DeepSpeed. I think all these should be made loud and clear - and perhaps a simple full example of a training loop would help, including showing commented out code where the original training code is removed to do it the DeepSpeed-way.

As I am trying to figure out how to make gradient_accumulation_steps work correctly I’m finding all kinds of things I have missed when integrating DeepSpeed into HF Trainer. I will post them here as I find such things:

  1. engine’s backward returns loss, which it modifies under gradient_accumulation_steps > 1 but this is undocumented.

  2. Also it’s not documented that the “client” shouldn’t scale loss by gradient_accumulation_steps since Deepspeed does it in backward.

  3. the fact that lr_scheduler.step happens inside engine’s step is not documented in the API

  4. the “client” must not skip engine.step() when gradient_accumulation_steps > 1, and since this is an integration of many methods this leads to a complicated brittle code:

                if self.deepspeed:
                    self.deepspeed.step()

                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (
                    # last step in epoch but step is always smaller than gradient_accumulation_steps
                    steps_in_epoch <= self.args.gradient_accumulation_steps
                    and (step + 1) == steps_in_epoch
                ):
                    # Gradient clipping
                    if self.args.max_grad_norm is not None and self.args.max_grad_norm > 0 and not self.deepspeed:
                        # deepspeed does its own clipping
                        if self.use_amp:
                            # AMP: gradients need unscaling
                            self.scaler.unscale_(self.optimizer)
                        [...]
                        else:
                            # Revert to normal clipping otherwise, handling Apex or full precision
                            torch.nn.utils.clip_grad_norm_(
                                amp.master_params(self.optimizer) if self.use_apex else model.parameters(),
                                self.args.max_grad_norm,
                            )

                    # Optimizer step
                    if self.deepspeed:
                        pass # called outside the loop
                    [...]
                    else:
                        self.optimizer.step()

                    if not self.deepspeed:
                        self.lr_scheduler.step()

                    model.zero_grad()
                    [...]

After fixing the above 4 I managed to get the same weights and loss with bs=8/grad_accum=1 and bs=4/grad_accum=2. Yay!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
janEbertcommented, Apr 30, 2021

Yea I guess it’s better to just give your concrete advice; who knows how the internals will change with the speed this is moving at. 😃

1reaction
janEbertcommented, Apr 30, 2021

Adding to this: 5. Remove model = model.cuda() and model = model.half() calls. 6. Do keep data = data.cuda() and optionally add data = data.half() calls (see below comment for a dynamic option). 7. Substitute torch.nn.utils.clip_grad_norm_(params, X) with "gradient_clipping": X in the DeepSpeed config. 8. Guard parallel I/O using parallel primitives (except DeepSpeed checkpointing which is already nicely documented in the code and at deepspeed.ai). 9. Use torch.distributed methods for parallel primitives (e.g. torch.distributed.get_world_size(), torch.distributed.get_rank(), int(os.environ['LOCAL_RANK']), …).

And a bit less interesting (though one of these should probably be adjusted instead of documented): 10. In deepspeed.initialize, the optimizer kwarg has precedence over the "optimizer" key in the DeepSpeed config. 11. In deepspeed.initialize, the "scheduler" key in the DeepSpeed config has precedence over the lr_scheduler kwarg.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's Missing From Your API Documentation - Stoplight Blog
Put yourself into a fresh developer mindset and see if you can find the areas of your documentation that may be missing or...
Read more >
API Documentation Beyond the Basic Swagger UI
What your API documentation may be missing. Generating a UI for end consumers to work with your API is a critical first step...
Read more >
How to Test an API Without Documentation - Dev Tester
Having to test an API without any notes or documentation seems impossible. But with some investigating, you can get the information you need....
Read more >
OAuth API verification FAQs - Google Cloud Platform Console ...
Last modified on: December 15, 2022 If your app uses Google APIs to ... API documentation or look for the lock icon in...
Read more >
API Standardization | SwaggerHub Documentation
1 - at least one CRITICAL error; 2 - generic error (for example, the specified API was not found). Using Registry API. You...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found