Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deepspeed and T5-11B for multitask training

See original GitHub issue

Carrying on my conversation here @stas00 https://github.com/huggingface/transformers/issues/9996#issuecomment-968348129

Used the run_translation.py and now my loss is 0.0 😦 . This is probably doomed to fail

{'loss': 7.2639, 'learning_rate': 0.001, 'epoch': 0.02}                                                                                                                                                     
  3%|████                                                                                                                                                            | 612/24128 [42:13<26:09:12,  4.00s/it]{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.04}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.06}                                                                                                                                                        
  8%|█████████████                                                                                                                                                | 1999/24128 [2:15:09<24:43:54,  4.02s/it][2021-11-25 22:01:13,181] [INFO] [logging.py:69:log_dist] [Rank 0] step=2000, skipped=1995, lr=[0.001, 0.001], mom=[0.0, 0.0]
[2021-11-25 22:01:13,181] [INFO] [timer.py:181:stop] 0/2000, SamplesPerSec=7.902960485741644
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.08}                                                                                                                                                        
{'loss': 0.0, 'learning_rate': 0.001, 'epoch': 0.1}

Script

export BS=8;
PYTHONPATH=../../../src
USE_TF=0

deepspeed --num_gpus=4 ./run_translation.py \
        --model_name_or_path t5-11b \
        --output_dir /local/nlp/temp/poetryT5-11B_new \
        --evaluation_strategy=epoch \
        --do_train \
        --train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
        --save_strategy=epoch \
        --label_smoothing 0.1 \
        --learning_rate 1e-3 \
        --adafactor \
        --overwrite_output_dir \
        --max_source_length 64 \
        --max_target_length 64 \
        --num_train_epochs 1 \
        --per_device_train_batch_size $BS \
        --per_device_eval_batch_size $BS \
        --source_lang en \
        --target_lang en \
        --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero2.json  \
        --fp16
~

Data format

{"translation": {"en1": "Write a poetic sentence about 'people'", "en2": "In this age what people used to call."}}
{"translation": {"en1": "Write a poetic sentence about 'tale'", "en2": "Where evening is empty, an unfinished tale."}}
{"translation": {"en1": "Write a poetic sentence that ends in a word which rhymes with 'planes'", "en2": "Now the blood freezes in the veins."}}
{"translation": {"en1": "Write a poetic sentence about 'Weighs his spread' and ending in 'behold'", "en2": "Weighs his spread wings, at leasure to behold."}}
{"translation": {"en1": "Write a poetic sentence about 'lips'", "en2": "Her dry lips were tightly closed up."}}

def preprocess_function(examples):
        inputs = [ex["en1"] for ex in examples["translation"]]
        targets = [ex["en2"] for ex in examples["translation"]]
        inputs = [prefix + inp for inp in inputs]
        model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)

        # Setup the tokenizer for targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)

        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

Issue Analytics

State:
Created 2 years ago
Comments:39 (21 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Nov 27, 2021

The first step is to make things work w/o overflow, the second step is dealing with memory.

As bf16 is all new it will take some time to fully sort it out. You can try solution (1) as well - it might just work.

So your fp32 OOM was w/ or w/o deepspeed?

fp32 takes about the same amount of memory as fp16 mixed precision, because the latter still allocates 4 bytes for master weights per param. So the latter saves some memory in some places, but uses more memory in others. fp16 amp is really about up to 5x speed up, not saving memory.

Here are the next things to try:

Experiment A. Try deepspeed with both fp16 and bf16 disabled and stage2 (your current setup) and add on top of run_translation.py add:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

how does that fair?

Experiment B. Same as A, but use stage 3 in the config file, and ensure your cpu offload is enabled - the default config file from the docs will do.

I of course assume you’re also using torch==1.10 and some fairly recent cuda - at least cuda=11.3

re: bf16-support in deepspeed I haven’t tried it myself yet as it was literally just added. I will give it a try.

2reactions

stas00commented, Nov 26, 2021

I have a feeling that the issue is not in using deepspeed but somewhere else in your setup.

Let’s remove deepspeed for a moment from the equation and try your setup with a single gpu setup with t5-large or even t5-small - make it work first so that it produces what you expect albeit with a lower quality.

Once this is working you can then progress to a higher model size and eventually you’d just plug deepspeed to work with t5-11b.

It’ll also make your debug process much easier since it takes forever to even load t5-11b.

Always start small and simple, then progress to bigger and slightly more complex, and then big and complex.

Top Results From Across the Web

t5-11b - Hugging Face

In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language ...

[2109.10465] Scalable and Efficient MoE Training for Multitask ...

A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual ...

Training Overview and Features - DeepSpeed

This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.

Increasing the scale and speed of deep learning ... - YouTube

In addition, the team will present deep-dive results on how they were able to obtain the world record for fastest BERT training. DeepSpeed...

The Guide to Multi-Tasking with the T5 Transformer

In this article, we'll be using this technique to train a single T5 model capable of performing the 3 NLP tasks, binary classification,...