Joined dictionary for more than 2 languages
See original GitHub issueHello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it’s fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.
One workaround that I did is to combine the training data from all languages, then call fairseq-preprocess
once to generate a joined dictionary. After that, I run fairseq-preprocess
separately on each language pair, re-using the joined dictionary in the first step.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:6
- Comments:9 (2 by maintainers)
Top GitHub Comments
To be more specific - if you use similar preprocess as
prepare-iwslt17-multilingual.sh
for English-Korean-Chinese, the sentence piece package will generate a vocabulary file [1]. We can convert it to fairseq vocabulary file format and assign it as predefined vocabulary when runningfairseq-preprocess
, e.g.Adding something like this in
prepare-iwslt17-multilingual.sh
:and when binarization, we add
--tgtdict fairseq.vocab
option to assign predefined vocabulary file.[1] https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh#L105
@okgrammer Hi! Did you ever find out the purpose of that loop
for LANG in "$SRC" "$TGT"; do
? Thanks!