Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Longformer EncoderDecoder (LED)-Large model finetuning for summarization results in </s><s><s><s><s><s><s><s><s><s><s>... output

See original GitHub issue

System Info

transformers version: 4.20.0.dev0
Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-centos-8.6-Green_Obsidian
Python version: 3.7.13
Huggingface_hub version: 0.7.0
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@ydshieh

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

OUTPUT_DIR=/home/ratish/project
python -m torch.distributed.launch --nproc_per_node=1 examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path allenai/led-large-16384 \
    --do_train \
    --do_eval \
    --dataset_name xsum \
    --output_dir ${OUTPUT_DIR} \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --overwrite_output_dir \
    --logging_dir logs \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 100 \
    --report_to tensorboard \
    --save_total_limit 5 \
    --save_steps 100 \
    --load_best_model_at_end \
    --greater_is_better True \
    --metric_for_best_model rougeL \
    --max_eval_samples 100 \
    --num_beams 3

The logs shows that at checkpoint 1800 the rouge becomes zero. {'eval_loss': 2.172360897064209, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 20.0, 'eval_runtime': 10.2823, 'eval_samples_per_second': 9.725, 'eval_steps_per_second': 2.431, 'epoch': 0.04}

I evaluate the model output using the below function:

def generate_output():
    import torch
    from transformers import LEDTokenizer, LEDForConditionalGeneration
    MODEL="/home/ratish/checkpoint-1800"
    model = LEDForConditionalGeneration.from_pretrained(MODEL)
    tokenizer = LEDTokenizer.from_pretrained(MODEL)
    ARTICLE_TO_SUMMARIZE = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
    inputs = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors="pt")
    global_attention_mask = torch.zeros_like(inputs)
    global_attention_mask[:, 0] = 1
    summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, num_beams=3, max_length=32)
    print(tokenizer.decode(summary_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False))

It produces the output </s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>

Expected behavior

The model should produce the summary of the news article.

Issue Analytics

State:
Created a year ago
Comments:32 (3 by maintainers)

Top GitHub Comments

2reactions

ydshiehcommented, Oct 10, 2022

Hi @ratishsp

As promised, I checked. You are right, perturbing bos token embedding is not helping for the checkpoint allenai/led-large-16384. (well, it helps a bit at the first few iterations, but once the steps continue, we get the same </s><s><s>.)

I ran out of the ideas, the only thing works is to avoid using </s> <s> <tok_1> <tok_2> ... when preparing labels. Instead, just using </s> <tok_1> <tok_2> .... To do so, add the following block after the line https://github.com/huggingface/transformers/blob/4dd784c32f76fb8285f205b94e2a6ebde731a1cd/examples/pytorch/summarization/run_summarization.py#L536

To add

        # Originally, the `labels` are of the form: </s> <s> ..., which causes trouble for finetuning some checkpoints.
        # Let's try to remove <s> (`bos` token) in `labels`, i.e. keep only the decoder_start_token (here </s>).

        model_inputs["labels"] = [x[1:] for x in model_inputs["labels"]]

Or you can simplify using my branch debug_led_large_bad_generation - this will save the generations after each evaluation.

You can verify the effect with (and without) this change by running a tiny training (with very few examples) below:

./run_summarization.py \
    --model_name_or_path allenai/led-large-16384 \
    --dataset_name xsum \
    --output_dir ./led-large-16384-xsum-no-bos-dummy-1 \
    --overwrite_output_dir \
    --logging_dir ./led-large-16384-xsum-no-bos-dummy-logs-1 \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --report_to tensorboard \
    --load_best_model_at_end \
    --greater_is_better True \
    --metric_for_best_model rougeL \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=4 \
    --evaluation_strategy steps \
    --max_steps 500 \
    --max_train_samples 500 \
    --max_eval_samples 100 \
    --logging_steps 100 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 10 \
    --generation_max_length 128 \
    --num_beams 3

Let me know if you can get normal results with this change 🙏 Thank you!

2reactions

ratishspcommented, Jul 19, 2022

Even more suprising, LED-Base model seems to be doing quite well!

Model output (checkpoint 1600): </s><s>The Eiffel Tower in Paris is the tallest structure in the world.</s>

Top Results From Across the Web

LED - Hugging Face

We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, ...

Finetune Longformer Encoder-Decoder (LED) on 8K Tokens

Pubmed is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length...

Longformer: The Long-Document Transformer - arXiv

Abstract: Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically ...

Training an Abstractive Summarization Model - TransformerSum

You can finetune/train abstractive summarization models such as BART and T5 with ... Version 3.1.0 of huggingface/transformers enhances the encoder-decoder ...

Transformer-Based Abstractive Summarization for Reddit and ...

models of text summarization on large-scale datasets in English, without further ... the fine-tuning and its satisfactory results for Reddit in English, ...