Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

--replace-unk causes bugs with fairseq-interactive

See original GitHub issue

🐛 Bug

When using farseq-interactive to generate translations, the --replace-unk argument causes several bugs.

The alignments are given as tuple, but the function apparently just expects a list of indices of the aligned source token.
When no alignment file is give, the standard input configuration ‘@@’ causes alignment file loader to break.
At last, when the out-of-vocabulary (OOV) word in the hypothesis is also OOV in the source dictionary, then you still get an <unk> in your translation. So I think, it would be good that in this case the original input is used to replace the <unk> in the translation.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

python fairseq-interactive.py fairseq-data-bin-10752
--path models/transformer_iwslt_de_en_10752-align/checkpoint_best.pt
--beam 5 --source-lang nl
--target-lang ql
--print-alignment --replace-unk
--tokenizer moses

input text: legal name of allianz Allianz is an OOV word for my task.

For 1.

Traceback (most recent call last):
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
    cli_main()
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 267, in main
    hypo_tokens, hypo_str, alignment = utils.post_process_prediction(
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 246, in post_process_prediction
    hypo_str = replace_unk(hypo_str, src_str, alignment, align_dict,
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 222, in replace_unk
    src_token = src_tokens[alignment[i]]
TypeError: list indices must be integers or slices, not tuple

For 2.:

Traceback (most recent call last):
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
    cli_main()
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 191, in main
    align_dict = utils.load_align_dict(cfg.generation.replace_unk)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 164, in load_align_dict
    with open(replace_unk, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '@@ '

Expected behavior

Replace the <unk> in the hypothesis by the corresponding word in the input according to the alignments. This should also be possible without an alignment dictionary.

I made a fix for 1., found a workaround for 2. and added some code to include feature described in 3.

I can provide a PR, if wished

Environment

fairseq Version (e.g., 1.0 or master): master, ‘1.0.0a0+2429317’
PyTorch Version (e.g., 1.0): 1.8.1
OS (e.g., Linux): MacOS 11.2.3
How you installed fairseq (pip, source): CFLAGS="-stdlib=libc++" pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.8.8
CUDA/cuDNN version: -
GPU models and configuration: -
Any other relevant information:

Additional context

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5

Top GitHub Comments

1reaction

jm-glowienkecommented, Dec 4, 2022

Hi, I found a solution for the problems described in the issue. They can be found on my personal fork of fairseq: https://github.com/jm-glowienke/fairseq Unfortunately, I cannot help you any further, as I only worked on this for my thesis almost 2 years ago.

1reaction

xihajuncommented, Dec 4, 2022

Hi @jm-glowienke I would also like to know if there is any solution to this issue

Are you also applying for transformer model?

This blog explained a bit about why their -replace-unk is not working for the transformer model. https://forum.opennmt.net/t/translate-py-with-replace-unk-option-and-the-transformer-model/2646

might be helpful somehow

[Update on Dec 04, 2022] My task was doing spelling correction, and I was trying to skip all the special characters to unk. I used an alternative way to achieve that:

replace all the special characters eg, 0-9 to <unk> for paired data (maybe also works for names and other words)
train the model
replace them back in order