Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[docs] critical API documentation is missing

See original GitHub issue

It looks like things kind of work, except not quite right, and there are a lot of subtle nuances that are so hard to know about when integrating DeepSpeed. I think all these should be made loud and clear - and perhaps a simple full example of a training loop would help, including showing commented out code where the original training code is removed to do it the DeepSpeed-way.

As I am trying to figure out how to make gradient_accumulation_steps work correctly I’m finding all kinds of things I have missed when integrating DeepSpeed into HF Trainer. I will post them here as I find such things:

engine’s backward returns loss, which it modifies under gradient_accumulation_steps > 1 but this is undocumented.
- neither in the API docstring https://github.com/microsoft/DeepSpeed/blob/e60e92eb0a06673748c4cb63fbcf713ddd12fc22/deepspeed/runtime/engine.py#L852-L858
- nor main docs: https://www.deepspeed.ai/getting-started/#training
Also it’s not documented that the “client” shouldn’t scale loss by gradient_accumulation_steps since Deepspeed does it in backward.
the fact that lr_scheduler.step happens inside engine’s step is not documented in the API
- https://github.com/microsoft/DeepSpeed/blob/e60e92eb0a06673748c4cb63fbcf713ddd12fc22/deepspeed/runtime/engine.py#L993-L996
- but is documented in https://www.deepspeed.ai/getting-started/#training
- it might be a good idea to also add an explicit - make sure to remove lr_scheduler.step() from your code if using DeepSpeed’s scheduler.
the “client” must not skip engine.step() when gradient_accumulation_steps > 1, and since this is an integration of many methods this leads to a complicated brittle code:

                if self.deepspeed:
                    self.deepspeed.step()

                if (step + 1) % self.args.gradient_accumulation_steps == 0 or (
                    # last step in epoch but step is always smaller than gradient_accumulation_steps
                    steps_in_epoch <= self.args.gradient_accumulation_steps
                    and (step + 1) == steps_in_epoch
                ):
                    # Gradient clipping
                    if self.args.max_grad_norm is not None and self.args.max_grad_norm > 0 and not self.deepspeed:
                        # deepspeed does its own clipping
                        if self.use_amp:
                            # AMP: gradients need unscaling
                            self.scaler.unscale_(self.optimizer)
                        [...]
                        else:
                            # Revert to normal clipping otherwise, handling Apex or full precision
                            torch.nn.utils.clip_grad_norm_(
                                amp.master_params(self.optimizer) if self.use_apex else model.parameters(),
                                self.args.max_grad_norm,
                            )

                    # Optimizer step
                    if self.deepspeed:
                        pass # called outside the loop
                    [...]
                    else:
                        self.optimizer.step()

                    if not self.deepspeed:
                        self.lr_scheduler.step()

                    model.zero_grad()
                    [...]

After fixing the above 4 I managed to get the same weights and loss with bs=8/grad_accum=1 and bs=4/grad_accum=2. Yay!

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

janEbertcommented, Apr 30, 2021

Yea I guess it’s better to just give your concrete advice; who knows how the internals will change with the speed this is moving at. 😃

1reaction

janEbertcommented, Apr 30, 2021

Adding to this: 5. Remove model = model.cuda() and model = model.half() calls. 6. Do keep data = data.cuda() and optionally add data = data.half() calls (see below comment for a dynamic option). 7. Substitute torch.nn.utils.clip_grad_norm_(params, X) with "gradient_clipping": X in the DeepSpeed config. 8. Guard parallel I/O using parallel primitives (except DeepSpeed checkpointing which is already nicely documented in the code and at deepspeed.ai). 9. Use torch.distributed methods for parallel primitives (e.g. torch.distributed.get_world_size(), torch.distributed.get_rank(), int(os.environ['LOCAL_RANK']), …).

And a bit less interesting (though one of these should probably be adjusted instead of documented): 10. In deepspeed.initialize, the optimizer kwarg has precedence over the "optimizer" key in the DeepSpeed config. 11. In deepspeed.initialize, the "scheduler" key in the DeepSpeed config has precedence over the lr_scheduler kwarg.