--replace-unk causes bugs with fairseq-interactive
See original GitHub issue🐛 Bug
When using farseq-interactive to generate translations, the --replace-unk
argument causes several bugs.
- The alignments are given as tuple, but the function apparently just expects a list of indices of the aligned source token.
- When no alignment file is give, the standard input configuration ‘@@’ causes alignment file loader to break.
- At last, when the out-of-vocabulary (OOV) word in the hypothesis is also OOV in the source dictionary, then you still get an
<unk>
in your translation. So I think, it would be good that in this case the original input is used to replace the<unk>
in the translation.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
python fairseq-interactive.py fairseq-data-bin-10752
--path models/transformer_iwslt_de_en_10752-align/checkpoint_best.pt
--beam 5 --source-lang nl
--target-lang ql
--print-alignment --replace-unk
--tokenizer moses
input text: legal name of allianz
Allianz is an OOV word for my task.
For 1.
Traceback (most recent call last):
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
cli_main()
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 267, in main
hypo_tokens, hypo_str, alignment = utils.post_process_prediction(
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 246, in post_process_prediction
hypo_str = replace_unk(hypo_str, src_str, alignment, align_dict,
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 222, in replace_unk
src_token = src_tokens[alignment[i]]
TypeError: list indices must be integers or slices, not tuple
For 2.:
Traceback (most recent call last):
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
cli_main()
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 191, in main
align_dict = utils.load_align_dict(cfg.generation.replace_unk)
File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 164, in load_align_dict
with open(replace_unk, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '@@ '
Expected behavior
Replace the <unk>
in the hypothesis by the corresponding word in the input according to the alignments. This should also be possible without an alignment dictionary.
I made a fix for 1., found a workaround for 2. and added some code to include feature described in 3.
I can provide a PR, if wished
Environment
- fairseq Version (e.g., 1.0 or master): master, ‘1.0.0a0+2429317’
- PyTorch Version (e.g., 1.0): 1.8.1
- OS (e.g., Linux): MacOS 11.2.3
- How you installed fairseq (
pip
, source):CFLAGS="-stdlib=libc++" pip install --editable ./
- Build command you used (if compiling from source):
- Python version: 3.8.8
- CUDA/cuDNN version: -
- GPU models and configuration: -
- Any other relevant information:
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5
Top Results From Across the Web
replace-unk causes bugs with fairseq-interactive · Issue #3533
Bug When using farseq-interactive to generate translations, the --replace-unk argument causes several bugs. The alignments are given as ...
Read more >How to use fairseq interactive.py non-interactively?
I am trying to translate from English to Arabic using Fairseq. But the interactive.py script translate pieces of text ...
Read more >Add fairseq to PyPI (#495) (fbd4cef9) · Commits - gitlab
Summary: - fairseq can now be installed via pip: `pip install ... :ref:`fairseq-interactive`: Translate raw text with a trained model.
Read more >Fairseq - Features, How to Use And Install, Github ... - Folio3.Ai
How to Install Fairseq – Interactive Installation Guide. There are a few simple steps to get started with fairseq. Follow the sequence: 1)...
Read more >Similar language translation - UPCommons
line Fairseq-interactive to translate the test data of the source language. This will generate a document with the data translated.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, I found a solution for the problems described in the issue. They can be found on my personal fork of
fairseq
: https://github.com/jm-glowienke/fairseq Unfortunately, I cannot help you any further, as I only worked on this for my thesis almost 2 years ago.Are you also applying for transformer model?
This blog explained a bit about why their
-replace-unk
is not working for the transformer model. https://forum.opennmt.net/t/translate-py-with-replace-unk-option-and-the-transformer-model/2646might be helpful somehow
[Update on Dec 04, 2022] My task was doing spelling correction, and I was trying to skip all the special characters to unk. I used an alternative way to achieve that: