Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gpt2-medium fine-tuned model.generate joins words and sentences together without space or newline

See original GitHub issue

Hi,

I have successfully fine-tuned and used a gpt2 model to generate text. My training corpus consist of short sentences - 3-5 words and longer ones 10-15 words. All separated by new line character. Sometimes ending with [ . ! ? ] sometimes not

outputs = model.generate( input_ids=input_ids, max_length=max_length, temperature=temperature, repetition_penalty=repetition_penalty, bos_token_id=tokenizer.bos_token_id, top_k=top_k, top_p=top_p )

ret = tokenizer.decode(outputs[0], skip_special_tokens=True)

Then I fine-tuned a gpt2-medium model. The training corpus was slightly different, but structured the same as described above.

I had to use --fp16 and --block_size=512 to fit in the GPU memory limits.

The result: using the fine-tuned a gpt2-medium model, I am experiencing a couple of issues:

I get frequent issues with lines or words stuck, without any new line or space: example: word1Word2Word3 or: line 1 with some words!Another line with some words™️Next line…
I get a ‘warning’: Setting pad_token_id to 50256 (first eos_token_id) to generate sequence

I’ve tried playing with the decode parameters with no luck: ret = tokenizer.decode(outputs[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)

Help appreciated,

thanks in advance, Albert

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

TheresaSchmidtcommented, Mar 8, 2021

Regarding the second question

2. I get a 'warning':
   Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

It’s explained here: “For open-end generation, HuggingFace will set the padding token ID to be equal to the end-of-sentence token ID”. Code is here: https://github.com/huggingface/transformers/blob/b880508440f43f80e35a78ccd2a32f3bde91cb23/src/transformers/generation_utils.py#L410-L414

0reactions

stale[bot]commented, Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.