Deepspeed and T5-11B for multitask training
See original GitHub issueCarrying on my conversation here @stas00 https://github.com/huggingface/transformers/issues/9996#issuecomment-968348129
Used the run_translation.py and now my loss is 0.0 😦 . This is probably doomed to fail
{'loss': 7.2639, 'learning_rate': 0.001, 'epoch': 0.02}
3%|████ | 612/24128 [42:13<26:09:12, 4.00s/it]{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.06}
8%|█████████████ | 1999/24128 [2:15:09<24:43:54, 4.02s/it][2021-11-25 22:01:13,181] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-25 22:01:13,181] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=7.902960485741644
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.08}
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.1}
Script
export BS=8;
PYTHONPATH=../../../src
USE_TF=0
deepspeed --num_gpus=4 ./run_translation.py \
--model_name_or_path t5-11b \
--output_dir /local/nlp/temp/poetryT5-11B_new \
--evaluation_strategy=epoch \
--do_train \
--train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
--save_strategy=epoch \
--label_smoothing 0.1 \
--learning_rate 1e-3 \
--adafactor \
--overwrite_output_dir \
--max_source_length 64 \
--max_target_length 64 \
--num_train_epochs 1 \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--source_lang en \
--target_lang en \
--deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json \
--fp16
~
Data format
{"translation": {"en1": "Write a poetic sentence about 'people'", "en2": "In this age what people used to call."}}
{"translation": {"en1": "Write a poetic sentence about 'tale'", "en2": "Where evening is empty, an unfinished tale."}}
{"translation": {"en1": "Write a poetic sentence that ends in a word which rhymes with 'planes'", "en2": "Now the blood freezes in the veins."}}
{"translation": {"en1": "Write a poetic sentence about 'Weighs his spread' and ending in 'behold'", "en2": "Weighs his spread wings, at leasure to behold."}}
{"translation": {"en1": "Write a poetic sentence about 'lips'", "en2": "Her dry lips were tightly closed up."}}
def preprocess_function(examples):
inputs = [ex["en1"] for ex in examples["translation"]]
targets = [ex["en2"] for ex in examples["translation"]]
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and data_args.ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
return model_inputs
Issue Analytics
- State:
- Created 2 years ago
- Comments:39 (21 by maintainers)
Top Results From Across the Web
t5-11b - Hugging Face
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language ...
Read more >[2109.10465] Scalable and Efficient MoE Training for Multitask ...
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual ...
Read more >Training Overview and Features - DeepSpeed
This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
Read more >Increasing the scale and speed of deep learning ... - YouTube
In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training. DeepSpeed...
Read more >The Guide to Multi-Tasking with the T5 Transformer
In this article, we'll be using this technique to train a single T5 model capable of performing the 3 NLP tasks, binary classification,...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The first step is to make things work w/o overflow, the second step is dealing with memory.
As bf16 is all new it will take some time to fully sort it out. You can try solution (1) as well - it might just work.
So your fp32 OOM was w/ or w/o deepspeed?
fp32 takes about the same amount of memory as fp16 mixed precision, because the latter still allocates 4 bytes for master weights per param. So the latter saves some memory in some places, but uses more memory in others. fp16 amp is really about up to 5x speed up, not saving memory.
Here are the next things to try:
Experiment A. Try deepspeed with both fp16 and bf16 disabled and stage2 (your current setup) and add on top of
run_translation.py
add:how does that fair?
Experiment B. Same as A, but use stage 3 in the config file, and ensure your cpu offload is enabled - the default config file from the docs will do.
I of course assume you’re also using torch==1.10 and some fairly recent cuda - at least cuda=11.3
re: bf16-support in deepspeed I haven’t tried it myself yet as it was literally just added. I will give it a try.
I have a feeling that the issue is not in using deepspeed but somewhere else in your setup.
Let’s remove deepspeed for a moment from the equation and try your setup with a single gpu setup with
t5-large
or event5-small
- make it work first so that it produces what you expect albeit with a lower quality.Once this is working you can then progress to a higher model size and eventually you’d just plug deepspeed to work with t5-11b.
It’ll also make your debug process much easier since it takes forever to even load t5-11b.
Always start small and simple, then progress to bigger and slightly more complex, and then big and complex.