question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting special characters in model generation

See original GitHub issue

Hello. I fine tuned my DialoGPT small model and trained it on Daily Dialogue database. When speaking with the model, on its last utterance generation, the outputs has random punctuation characters. This is my generation:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-daily')

raw = ""
while raw != "END":
  for step in range(4):
    raw = input(">> User:")
    if raw == "END":
      break
  
    new_user_input_ids = tokenizer.encode(raw + tokenizer.eos_token, return_tensors='pt')

    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    chat_history_ids = model.generate(
        bot_input_ids, 
        max_length=150,
        do_sample=True,
        temperature = 0.7,
        top_k=50, 
        top_p=0.92,
        no_repeat_ngram_size=3,
        pad_token_id=tokenizer.eos_token_id
    )
    print("Bot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
  bot_input_ids = []
  print("\n")

The output:

>> User:Hello, what are you doing?
Bot: I'm just watching TV.
>> User:What is your favorite movie?
Bot: My favorite movie is Superbad.
>> User:I don't know it.
Bot: I like comedies best.
>> User:Me too!
Bot: !!!( At the end of the movie, O. K. )!!!!!1!!@!!excuse me!!( at the end, O'K. )!!excuses me!!Excuse me!(At the end )!exc!!automatics!!remotes!!Automatics!rematics!autos!!Remotes!autop!!Automatics!!autoblog!!log!autobi


>> User:Me too!
Bot: What do you think of this dress?
>> User:It's amazing!
Bot: I can't believe it.
>> User:Yes I lie.
Bot: What does it look like?
>> User:A potato.
Bot: !!!( After trying )!!!!!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
archmagos-dominuscommented, Sep 21, 2022

Since it breaks after step 3/4 , a potential hacky solution is maintain a queue with a fixed length of 3 maybe which stores past inputs and outputs and use them rather than the whole history, although some context is lost this would allow the chatbot to run endlessly without breaking down and keeping some context rather than none as in step hardcoded to 0.

Changing the number of chat rounds kept in memory proved to solve the issue most of the time, however, it was not as reliable as I needed it to be. As per my response on Sep 17, I have instead taken the length of the tensor into account and using a ‘hacky’ fix like the one below I was able to get it to work without freaking out at all.

    if bot_input_ids.size(dim=1) >= args.get('max_length'):
      #trim the tokens
      bot_input_ids = torch.narrow(bot_input_ids, 1, -args.get('max_length'), args.get('max_length'))

When you say EoS is not added, is there a way to add it manually? Like after every response we add EoS , would that fix the issue?

To be absolutely honest I did not pursue this line of thinking since I managed to get it working well enough for my implementation. If adding EoS manually will make it behave properly, I do not know.

0reactions
Nakul24-1commented, Sep 18, 2022

Yeah I have encountered the same issues. The model just returns tens of “!!!” and then cannot be conversed with anymore. This behaviour happens after the 4th round of the conversation, like clockwork. The problem seems to step from the implementation of chat history. With the step hardcoded to constant 0, the bot works, albeit without any memory. as the step reaches 3 everything just breaks down. Maybe it’s a dataset issue, or maybe it is some sort of memory issue. EDIT: Seems like after a few round the EoS token that should end the round is not longer added after the bot response.

Did you solve it?

Since it breaks after step 3/4 , a potential hacky solution is maintain a queue with a fixed length of 3 maybe which stores past inputs and outputs and use them rather than the whole history, although some context is lost this would allow the chatbot to run endlessly without breaking down and keeping some context rather than none as in step hardcoded to 0.

Yeah I have encountered the same issues. The model just returns tens of “!!!” and then cannot be conversed with anymore. This behaviour happens after the 4th round of the conversation, like clockwork. The problem seems to step from the implementation of chat history. With the step hardcoded to constant 0, the bot works, albeit without any memory. as the step reaches 3 everything just breaks down. Maybe it’s a dataset issue, or maybe it is some sort of memory issue. EDIT: Seems like after a few round the EoS token that should end the round is not longer added after the bot response.

Did you solve it?

I did not manage to figure out the root cause of the problem. I did manage to make the bot respond as it should by constraining the lenght of the chat_history_id to a maximum length of 50 in my case. It no longer freaks out, but it is also quite limited when it comes to generating responses that take conversation context into account. I hope this bandaid fix might work well enough for your implementation as well.

When you say EoS is not added, is there a way to add it manually? Like after every response we add EoS , would that fix the issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

YANG model Special Characters includes @ - Stack Overflow
The first character must be an underscore or a letter, and may be followed by letters, digits, underscores, dots and hyphens. An identifier...
Read more >
Besides Word Embedding, why you need to know Character ...
In the paper, a list of character are defined 70 characters which including 26 English letters, 10 digits, 33 special characters and new...
Read more >
Special Characters - Salesforce Help
Certain characters have a special meaning in CRM Analytics. Character Name Description ' Single quote Encloses a dataset column name in a predicate...
Read more >
Displaying special characters in the console - Cloud - 8.0
About this task Talend Studio can display special characters in the console. To enable the display of Chinese, Japanese or Korean characters, for...
Read more >
How to work with special characters in Illustrator
Right click and choose Insert Special Character from the context menu. Choose one of the following options: Symbols, Hyphens And Dashes, and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found