question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generate text without <unk> tokens

See original GitHub issue

Hello,

I’m trying to generate text that does not include <unk> tokens. When I run

python generate.py generate/data-bin/dummy --path \
    checkpoints/checkpoint_best.pt --batch-size 32 --beam 1 \
    --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 \
    --replace-unk REPLACE_UNK

I get the following error

Traceback (most recent call last):
  File "generate.py", line 171, in <module>
    main(args)
  File "generate.py", line 25, in main
    '--replace-unk requires a raw text dataset (--raw-text)'
AssertionError: --replace-unk requires a raw text dataset (--raw-text)

But when adding the --raw-text argument, the model seems to infer that this is a translation task rather than a text generation task:

Traceback (most recent call last):
  File "generate.py", line 171, in <module>
    main(args)
  File "generate.py", line 34, in main
    task = tasks.setup_task(args)
  File "/home/edb2129/fairseq/fairseq/tasks/__init__.py", line 19, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args)
  File "/home/edb2129/fairseq/fairseq/tasks/translation.py", line 83, in setup_task
    raise Exception('Could not infer language pair, please provide it explicitly')
Exception: Could not infer language pair, please provide it explicitly

Is there a way to generate text from writing prompts without <unk> tokens?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
Crista23commented, Apr 19, 2021

@bhardwaj1230 I have the same problem as you, did you find a fix for it? Thanks!

1reaction
bhardwaj1230commented, Sep 18, 2019

Hello,

I am facing the same issue, when I use the argument “–raw-text” it says “FileNotFoundError: Dataset not found: test”, but I have required files in the folder : dict.en.txt dict.fr.txt test.en-fr.en.bin test.en-fr.en.idx test.en-fr.fr.bin test.en-fr.fr.idx.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Generate text without <unk> tokens · Issue #481 - GitHub
Hello, I'm trying to generate text that does not include tokens. ... there a way to generate text from writing prompts without <unk>...
Read more >
How to handle <UKN> tokens in text generation - Stack Overflow
What should it be outputting instead of the <unk> ? I don't want to build a generator that outputs words it does not...
Read more >
No <unk> token in the dataset but <unk> is generated in the ...
I use BPE to have no <unk> token in my dataset. Trained a model using OpenNMT-py with default parameters. Surprisingly, running translate.py ...
Read more >
machine learning - Do we really need <unk> tokens?
The <unk> tags can simply be used to tell the model that there is stuff, which is not semantically important to the output....
Read more >
Tokenizers - Hugging Face Course
Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found