Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joined dictionary for more than 2 languages

See original GitHub issue

Hello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it’s fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.

One workaround that I did is to combine the training data from all languages, then call fairseq-preprocess once to generate a joined dictionary. After that, I run fairseq-preprocess separately on each language pair, re-using the joined dictionary in the first step.

Issue Analytics

State:
Created 4 years ago
Reactions:6
Comments:9 (2 by maintainers)

Top GitHub Comments

3reactions

pipibjccommented, Jul 7, 2019

To be more specific - if you use similar preprocess as prepare-iwslt17-multilingual.sh for English-Korean-Chinese, the sentence piece package will generate a vocabulary file [1]. We can convert it to fairseq vocabulary file format and assign it as predefined vocabulary when running fairseq-preprocess, e.g.

Adding something like this in prepare-iwslt17-multilingual.sh:

# strip the first three special tokens and append fake counts for each vocabulary
tail -n +4 $DATA/sentencepiece.bpe.vocab | cut -f1 | sed 's/$/ 100/g' > fairseq.vocab

and when binarization, we add --tgtdict fairseq.vocab option to assign predefined vocabulary file.

[1] https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh#L105

2reactions

feralvamcommented, Nov 15, 2019

@okgrammer Hi! Did you ever find out the purpose of that loop for LANG in "$SRC" "$TGT"; do? Thanks!