gpt2-medium fine-tuned model.generate joins words and sentences together without space or newline
See original GitHub issueHi,
I have successfully fine-tuned and used a gpt2 model to generate text. My training corpus consist of short sentences - 3-5 words and longer ones 10-15 words. All separated by new line character. Sometimes ending with [ . ! ? ] sometimes not
outputs = model.generate( input_ids=input_ids, max_length=max_length, temperature=temperature, repetition_penalty=repetition_penalty, bos_token_id=tokenizer.bos_token_id, top_k=top_k, top_p=top_p )
ret = tokenizer.decode(outputs[0], skip_special_tokens=True)
Then I fine-tuned a gpt2-medium model. The training corpus was slightly different, but structured the same as described above.
I had to use --fp16 and --block_size=512 to fit in the GPU memory limits.
The result: using the fine-tuned a gpt2-medium model, I am experiencing a couple of issues:
-
I get frequent issues with lines or words stuck, without any new line or space: example: word1Word2Word3 or: line 1 with some words!Another line with some words™️Next line…
-
I get a ‘warning’: Setting
pad_token_id
to 50256 (firsteos_token_id
) to generate sequence
I’ve tried playing with the decode parameters with no luck:
ret = tokenizer.decode(outputs[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)
Help appreciated,
thanks in advance, Albert
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:8 (3 by maintainers)
Top GitHub Comments
Regarding the second question
It’s explained here: “For open-end generation, HuggingFace will set the padding token ID to be equal to the end-of-sentence token ID”. Code is here: https://github.com/huggingface/transformers/blob/b880508440f43f80e35a78ccd2a32f3bde91cb23/src/transformers/generation_utils.py#L410-L414
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.