Longformer EncoderDecoder (LED)-Large model finetuning for summarization results in </s><s><s><s><s><s><s><s><s><s><s>... output
See original GitHub issueSystem Info
transformers
version: 4.20.0.dev0- Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-centos-8.6-Green_Obsidian
- Python version: 3.7.13
- Huggingface_hub version: 0.7.0
- PyTorch version (GPU?): 1.11.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
OUTPUT_DIR=/home/ratish/project
python -m torch.distributed.launch --nproc_per_node=1 examples/pytorch/summarization/run_summarization.py \
--model_name_or_path allenai/led-large-16384 \
--do_train \
--do_eval \
--dataset_name xsum \
--output_dir ${OUTPUT_DIR} \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate \
--overwrite_output_dir \
--logging_dir logs \
--evaluation_strategy steps \
--eval_steps 100 \
--logging_steps 100 \
--report_to tensorboard \
--save_total_limit 5 \
--save_steps 100 \
--load_best_model_at_end \
--greater_is_better True \
--metric_for_best_model rougeL \
--max_eval_samples 100 \
--num_beams 3
The logs shows that at checkpoint 1800 the rouge becomes zero.
{'eval_loss': 2.172360897064209, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 20.0, 'eval_runtime': 10.2823, 'eval_samples_per_second': 9.725, 'eval_steps_per_second': 2.431, 'epoch': 0.04}
I evaluate the model output using the below function:
def generate_output():
import torch
from transformers import LEDTokenizer, LEDForConditionalGeneration
MODEL="/home/ratish/checkpoint-1800"
model = LEDForConditionalGeneration.from_pretrained(MODEL)
tokenizer = LEDTokenizer.from_pretrained(MODEL)
ARTICLE_TO_SUMMARIZE = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
inputs = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors="pt")
global_attention_mask = torch.zeros_like(inputs)
global_attention_mask[:, 0] = 1
summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, num_beams=3, max_length=32)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False))
It produces the output </s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Expected behavior
The model should produce the summary of the news article.
Issue Analytics
- State:
- Created a year ago
- Comments:32 (3 by maintainers)
Top Results From Across the Web
LED - Hugging Face
We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, ...
Read more >Finetune Longformer Encoder-Decoder (LED) on 8K Tokens
Pubmed is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length...
Read more >Longformer: The Long-Document Transformer - arXiv
Abstract: Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically ...
Read more >Training an Abstractive Summarization Model - TransformerSum
You can finetune/train abstractive summarization models such as BART and T5 with ... Version 3.1.0 of huggingface/transformers enhances the encoder-decoder ...
Read more >Transformer-Based Abstractive Summarization for Reddit and ...
models of text summarization on large-scale datasets in English, without further ... the fine-tuning and its satisfactory results for Reddit in English, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @ratishsp
As promised, I checked. You are right, perturbing bos token embedding is not helping for the checkpoint
allenai/led-large-16384
. (well, it helps a bit at the first few iterations, but once the steps continue, we get the same</s><s><s>
.)I ran out of the ideas, the only thing works is to avoid using
</s> <s> <tok_1> <tok_2> ...
when preparinglabels
. Instead, just using</s> <tok_1> <tok_2> ...
. To do so, add the following block after the line https://github.com/huggingface/transformers/blob/4dd784c32f76fb8285f205b94e2a6ebde731a1cd/examples/pytorch/summarization/run_summarization.py#L536To add
Or you can simplify using my branch debug_led_large_bad_generation - this will save the generations after each evaluation.
You can verify the effect with (and without) this change by running a tiny training (with very few examples) below:
Let me know if you can get normal results with this change 🙏 Thank you!
Even more suprising, LED-Base model seems to be doing quite well!
Model output (checkpoint 1600):
</s><s>The Eiffel Tower in Paris is the tallest structure in the world.</s>