question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gpt2-medium fine-tuned model.generate joins words and sentences together without space or newline

See original GitHub issue

Hi,

I have successfully fine-tuned and used a gpt2 model to generate text. My training corpus consist of short sentences - 3-5 words and longer ones 10-15 words. All separated by new line character. Sometimes ending with [ . ! ? ] sometimes not

outputs = model.generate( input_ids=input_ids, max_length=max_length, temperature=temperature, repetition_penalty=repetition_penalty, bos_token_id=tokenizer.bos_token_id, top_k=top_k, top_p=top_p )

ret = tokenizer.decode(outputs[0], skip_special_tokens=True)

Then I fine-tuned a gpt2-medium model. The training corpus was slightly different, but structured the same as described above.

I had to use --fp16 and --block_size=512 to fit in the GPU memory limits.

The result: using the fine-tuned a gpt2-medium model, I am experiencing a couple of issues:

  1. I get frequent issues with lines or words stuck, without any new line or space: example: word1Word2Word3 or: line 1 with some words!Another line with some words™️Next line…

  2. I get a ‘warning’: Setting pad_token_id to 50256 (first eos_token_id) to generate sequence

I’ve tried playing with the decode parameters with no luck: ret = tokenizer.decode(outputs[0], skip_special_tokens=False, clean_up_tokenization_spaces=False)

Help appreciated,

thanks in advance, Albert

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:4
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
TheresaSchmidtcommented, Mar 8, 2021

Regarding the second question

2. I get a 'warning':
   Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence

It’s explained here: “For open-end generation, HuggingFace will set the padding token ID to be equal to the end-of-sentence token ID”. Code is here: https://github.com/huggingface/transformers/blob/b880508440f43f80e35a78ccd2a32f3bde91cb23/src/transformers/generation_utils.py#L410-L414

0reactions
stale[bot]commented, Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fine-Tune GPT-2 for Text Generation
The idea is to use the already trained model, fine-tune it to our specific data and then, based on what the model observes,...
Read more >
Text generation with GPT-2 - Model Differently
In this post we will see how to generate text with models based on the Transformers architecture, and we will use this knowledge...
Read more >
arXiv:2101.00027v1 [cs.CL] 31 Dec 2020
The Pile: An 800GB Dataset of Diverse Text for Language Modeling ... sentence into words and computed the percentage.
Read more >
Few-Shot Learning with Language Models
ways of injecting new embeddings into this existing vector space. Accordingly, the representations we create for new words must not only ...
Read more >
Exploiting Latent Features of Text and Graphs - TigerPrints
are objects a and c, which we queried to create these topic models. ... ing and embedding a large semantic graph built around...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found