Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] large scale models support

See original GitHub issue

As I am integrating DeepSpeed ZeRO-3 which can run on hundreds of gpus and train models with Trillion of params https://github.com/huggingface/transformers/pull/10753 I see an emerging need to adjust how the trainer is used.

Currently the usage is:

model = T5ForConditionalGeneration.from_pretrained("t5-small")
trainer = Trainer(model=model, ....)
trainer.train()

The problem is that this implies that the model can fit in the first node’s general RAM and it’s not always the case. So for example in my PR I propose the following change:

    from transformers.integrations import deepspeed_is_zero3_enabled
    deepspeed_is_zero3_enabled(True)
    model = T5ForConditionalGeneration.from_pretrained("t5-small")

and I change from_pretrained to not init the model right away on cpu and to deal with pre-trained weights loading directly on all participating gpus - which allows loading models that are bigger than one gpu. Since the PR hasn’t been reviewed yet - (I’m still working on it), the API may change, but the what I’m trying t communicate here is that we need DeepSpeed configuration before we create the model. This change is only needed for ZeRO3 and at the moment I have no knowledge of that until the trainer is created. (but I’m changing this).

While we can automagically can discover if we are running under zero3 if a user is using cl args and passes --deepspeed ds_config.json, but I can’t do this if a user isn’t using the command line to launch the script.

In addition in the Trainer we already have a ton of logic where we purposefully don’t model.to(device) - so it’s another indication where the model placement needs a special treatment.

So the paradigm shift that may have to happen is where we init the Trainer first, gather all the info we need about how the model will be used. Then we init the model and pass it to the existing Trainer object, then we train. So something like:

trainer = Trainer(...)
new_model_init_specific_args = trainer.model_init_specific_args()
model = T5ForConditionalGeneration.from_pretrained("t5-small", **new_model_init_specific_args)
trainer.model(model)
trainer.train()

Please let me know if the need makes sense.

I think I can manage the current PR with some hacks to avoid this, but eventually I think we will need to switch to something that I proposed here to move into the future where we support very large models.

Nothing that needs to be done right away, just sharing the emerging need.

Here is a bit of a preview of how I had to change from_pretrained():

https://github.com/huggingface/transformers/blob/538a4026a1c6c477c1932b435dcce7cbacfc5898/src/transformers/modeling_utils.py#L1062-L1068

https://github.com/huggingface/transformers/blob/538a4026a1c6c477c1932b435dcce7cbacfc5898/src/transformers/modeling_utils.py#L1124-L1135

This allows loading the exact partition of the params for each gpu w/o ever loading it in CPU or a single gpu (well state_dict loading is a problem at the moment as it still gets fully copied in cpu, but we will have to sort this out down the road).

In the following addition, we invade generation_utils because now we have to make all gpus work in sync and can’t stop running forward until all gpus finished generating their sequence.

https://github.com/huggingface/transformers/blob/538a4026a1c6c477c1932b435dcce7cbacfc5898/src/transformers/generation_utils.py#L1273-L1287

so that’s another new concept, but this one is less of an issue with how the Trainer is run - just wanted to give a complete picture of the major needs. (And this particular code will change a bit thanks to @patrickvonplaten’s commentary - just didn’t get to do it yet)

Please also feel free to comment in the PR directly as that part of the code is pretty complete. I just made this issue separate to discuss the bigger need.

Thank you!

@sgugger, @LysandreJik, @patrickvonplaten

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Mar 25, 2021

Thanks for the very detailed summary @stas00! All of the changes you propose make sense. The changes to from_pretrained look inevitable, and the approach you propose looks like it does the job without being invasive in other parts of the library that we want to keep readable like the model files.

I know the API isn’t final and prone to changes, but could we imagine a flag like deepspeed_aware_instantiation or deepspeed_partitioning in the from_pretrained method, rather than a deepspeed_is_zero3_enabled(True)? I think this would be more in line with how we manage things in the library from the user’s perspective (which is principally through kwargs). I know none of this is final, but thinking of the API beforehand doesn’t sound like a bad idea before everything is implemented 😃

1reaction

sguggercommented, Mar 25, 2021

I’m not sure you are aware but the Trainer can take a model_init parameter that… well… creates the model 😉 Have you explored how it could help with this particular problem?

The changes in the other parts of the lib look reasonable to me at first glance.

Top Results From Across the Web

Train 1 trillion+ parameter models - PyTorch Lightning

With this chunk mechanism, really large models can be trained with a small number of GPUs. It supports larger trainable model size and...

Efficient Training on a Single GPU - Hugging Face

This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...

Hugging Face Pre-trained Models: Find the Best One for Your ...

Hugging Face supports more than 20 libraries and some of them are very ... Transformers are language models and have been trained on...

Fully Sharded Data Parallel: faster AI training with fewer GPUs

Fully Sharded Data Parallel (FSDP) makes training larger, more advanced AI models more efficiently than ever using fewer GPUs.

How to choose an ML.NET algorithm - Microsoft Learn

Trainer = Algorithm + Task; Linear algorithms ... Support vector machines ... these models to efficiently scale to larger training sets.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[trainer] large scale models support

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Improve the documentation for TrainingArguments.label_names, and if possible raise an error if users misinterpret this attribute like I did

Memory accumulates when training in a loop