Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finue-tuning T5 model

See original GitHub issue

Hi, I want to fine-tune T5 for a seq2seq task and I’m using the T5ForConditionalGeneration as it seems to have an LM decoder on top. As there’s no code example for this, I have lots of questions:

Am I doing the right thing?
I’m using the Adam optimizer. Is it ok?
I’m a bit confused about the forward inputs in the training phase. I read this explanation over and over again and I don’t understand whether I should just use input_ids and lm_labels for the training or not. Also somewhere in this issue someone’s mentioned that:

T5 input sequence should be formatted with [CLS] and [SEP] tokens

So which one is right? I’m super confused.

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:33 (21 by maintainers)

Top GitHub Comments

10reactions

patrickvonplatencommented, May 3, 2020

Hi @amitness,

For T5 summarization you will have to append the prefix "summarize: " to every input data. But you are more or less right. All you have to do is:

Prepare input data

x = tokenizer.encode_plus("summarize: " + sentence, 
                          max_length=500, 
                          pad_to_max_length=True, 
                          return_tensors='pt')

Prepare labels

lm_labels = tokenizer.encode_plus(summary,  
                            return_tensors='pt', 
                            max_length=50, 
                            pad_to_max_length=True)

For tokens that are padded (which is only relevant if you train with batch_size > 1) you need to make sure that no loss is calculated on those tokens, so

lm_labels[lm_labels == tokenizer.pad_token_id] = -100

There is no need to shift the tokens as you show at the end of your comment because T5 does that automatically - see https://github.com/huggingface/transformers/blob/6af3306a1da0322f58861b1fbb62ce5223d97b8a/src/transformers/modeling_t5.py#L1063.

This is also explained in https://huggingface.co/transformers/model_doc/t5.html#training .

4reactions

enzoampilcommented, May 4, 2020

@amitness

E.g. in your summarization case, it would look something like:

from transformers import T5Tokenizer, T5Model

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")
decoder_input_ids = tokenizer.encode("<pad>", return_tensors="pt") 
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
outputs[0]

Do note that T5ForConditionalGeneration already prepends the padding by default. Above is only necessary if you’re doing a forward pass straight from T5Model.

Regarding your question about making your own prefix, yes, you should be able to train on your own prefix. This is the whole point of T5’s text-to-text approach. You should be able to specify any problem through this kind of approach (e.g. Appendix D in the T5 paper).

Top Results From Across the Web

A Full Guide to Finetuning T5 for Text2Text and Building a ...

In this article, we see a complete example of fine-tuning of T5 for generating candidate titles for articles. The model is fine-tuned ......

Fine Tuning T5 Transformer Model with PyTorch

A T5 is an encoder-decoder model. It converts all NLP problems like language translation, summarization, text generation, question-answering, to ...

Fine Tuning a T5 transformer for any Summarization Task

The T5 tuner is a pytorch lightning class that defines the data loaders, forward pass through the model, training one step, validation on...

Top 3 Fine-Tuned T5 Transformer Models - Vennify.ai

In this article I'll discuss my top three favourite fine-tuned T5 models that are available on Hugging Face's Model Hub. T5 was published...

mrm8488/t5-base-finetuned-break_data - Hugging Face

The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Finue-tuning T5 model

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ValueError: You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one.

Using 'ner' task in pipeline with a non default model gives me entities as "LABEL-6" , "LABEL-8" instead of "I-ORG" and "I-LOC"