[trainer] large scale models support
See original GitHub issueAs I am integrating DeepSpeed ZeRO-3 which can run on hundreds of gpus and train models with Trillion of params https://github.com/huggingface/transformers/pull/10753 I see an emerging need to adjust how the trainer is used.
Currently the usage is:
model = T5ForConditionalGeneration.from_pretrained("t5-small")
trainer = Trainer(model=model, ....)
trainer.train()
The problem is that this implies that the model can fit in the first node’s general RAM and it’s not always the case. So for example in my PR I propose the following change:
from transformers.integrations import deepspeed_is_zero3_enabled
deepspeed_is_zero3_enabled(True)
model = T5ForConditionalGeneration.from_pretrained("t5-small")
and I change from_pretrained
to not init the model right away on cpu and to deal with pre-trained weights loading directly on all participating gpus - which allows loading models that are bigger than one gpu. Since the PR hasn’t been reviewed yet - (I’m still working on it), the API may change, but the what I’m trying t communicate here is that we need DeepSpeed configuration before we create the model. This change is only needed for ZeRO3 and at the moment I have no knowledge of that until the trainer is created. (but I’m changing this).
While we can automagically can discover if we are running under zero3 if a user is using cl args and passes --deepspeed ds_config.json
, but I can’t do this if a user isn’t using the command line to launch the script.
In addition in the Trainer we already have a ton of logic where we purposefully don’t model.to(device)
- so it’s another indication where the model placement needs a special treatment.
So the paradigm shift that may have to happen is where we init the Trainer
first, gather all the info we need about how the model will be used. Then we init the model and pass it to the existing Trainer object, then we train. So something like:
trainer = Trainer(...)
new_model_init_specific_args = trainer.model_init_specific_args()
model = T5ForConditionalGeneration.from_pretrained("t5-small", **new_model_init_specific_args)
trainer.model(model)
trainer.train()
Please let me know if the need makes sense.
I think I can manage the current PR with some hacks to avoid this, but eventually I think we will need to switch to something that I proposed here to move into the future where we support very large models.
Nothing that needs to be done right away, just sharing the emerging need.
Here is a bit of a preview of how I had to change from_pretrained()
:
This allows loading the exact partition of the params for each gpu w/o ever loading it in CPU or a single gpu (well state_dict loading is a problem at the moment as it still gets fully copied in cpu, but we will have to sort this out down the road).
In the following addition, we invade generation_utils
because now we have to make all gpus work in sync and can’t stop running forward
until all gpus finished generating their sequence.
so that’s another new concept, but this one is less of an issue with how the Trainer is run - just wanted to give a complete picture of the major needs. (And this particular code will change a bit thanks to @patrickvonplaten’s commentary - just didn’t get to do it yet)
Please also feel free to comment in the PR directly as that part of the code is pretty complete. I just made this issue separate to discuss the bigger need.
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Thanks for the very detailed summary @stas00! All of the changes you propose make sense. The changes to
from_pretrained
look inevitable, and the approach you propose looks like it does the job without being invasive in other parts of the library that we want to keep readable like the model files.I know the API isn’t final and prone to changes, but could we imagine a flag like
deepspeed_aware_instantiation
ordeepspeed_partitioning
in thefrom_pretrained
method, rather than adeepspeed_is_zero3_enabled(True)
? I think this would be more in line with how we manage things in the library from the user’s perspective (which is principally through kwargs). I know none of this is final, but thinking of the API beforehand doesn’t sound like a bad idea before everything is implemented 😃I’m not sure you are aware but the
Trainer
can take amodel_init
parameter that… well… creates the model 😉 Have you explored how it could help with this particular problem?The changes in the other parts of the lib look reasonable to me at first glance.