Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] new in pytorch: `torch.optim._multi_tensor` faster optimizers

See original GitHub issue

Back in September pytorch introduced torch.optim._multi_tensor https://github.com/pytorch/pytorch/pull/43507 which should be much more efficient for situations with lots of small feature tensors (transformers) and thus should show an appreciable speed up in training. If someone is interested in the progress of this project here is the stack to track: https://github.com/pytorch/pytorch/pull/48223

This feature is currently an alpha stage, so users can try to use it by simply replacing torch.optim with torch.optim._multi_tensor in HF Trainer or their own trainer.

Eventually it’ll replace torch.optim so there is nothing that we need to do otherwise.

@blefaudeux who alerted me to this improvement suggested it should have good speed ups for the DDP/Sharded DDP training.

If resources allow it’d be good to run some benchmarks. Please feel free to beat me to it.

Thanks to @blefaudeux for the heads up, and @izdeby for working on this enhancement and clarifying where things are at.

heads up to: @sgugger, @patrickvonplaten - nothing else that needs to be done.

Issue Analytics

State:
Created 3 years ago
Reactions:5
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Jan 15, 2022

Yes, I was just about to revisit it.

edit: I thought you might have wanted to work on that, but the pytorch team asks to run a profiler on it and all, so I probably will look into testing it out again.

— original comment —

Do you want to take a lead on this experiment, @jaketae?

The new --optim HF Trainer just got merged, so you can quickly implement --optim adamw_torch_multi_tensor in the same way --optim adamw

You can use this tool for benchmarking https://github.com/huggingface/transformers/pull/14934 if it helps. I think it’s pretty stable now, I will propose to PR it.

1reaction

blefaudeuxcommented, Feb 2, 2021

you must have a really strange bottleneck in that test, neither the latest fairscale nor these are changing anything ? These optimizers are measurably faster in isolation, and sure enough we see a difference in fairscale CI, even on a dummy job / small model (see for instance, two last jobs)

Top Results From Across the Web

torch.optim — PyTorch 1.13 documentation

torch.optim is a package implementing various optimization algorithms. Most commonly used methods are already supported, and the interface is general enough ...

optim — PyTorch Tutorials 1.12.1+cu102 documentation

The optim package defines many optimization algorithms that are commonly used for deep learning, including SGD+momentum, RMSProp, Adam, etc. import torch ...

Performance Tuning Guide - PyTorch

Performance Tuning Guide is a set of optimizations and best practices which can accelerate training and inference of deep learning models in PyTorch....

Optimizing Model Parameters — PyTorch Tutorials 1.13.0+ ...

Optimization Loop · The Train Loop - iterate over the training dataset and try to converge to optimal parameters. · The Validation/Test Loop...

Training with PyTorch

In this video, we'll be adding some new tools to your inventory: ... Optimizers specified in the torch.optim package optimizer = torch.optim.