question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Joined dictionary for more than 2 languages

See original GitHub issue

Hello, in the multilingual translation example, a joined dictionary is created between de-en, then the resulting dictionary is used for fr-en. In this case I think it’s fine because there are probably a lot of overlaps in the vocabulary among these languages. However, what if I have 3 really different languages with fewer overlaps, for instance English-Korean-Chinese, each having their own writing systems? For example, if I create a joined dictionary for English-Korean first, then a lot of Chinese subwords may be missing in the final dictionary.

One workaround that I did is to combine the training data from all languages, then call fairseq-preprocess once to generate a joined dictionary. After that, I run fairseq-preprocess separately on each language pair, re-using the joined dictionary in the first step.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:6
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
pipibjccommented, Jul 7, 2019

To be more specific - if you use similar preprocess as prepare-iwslt17-multilingual.sh for English-Korean-Chinese, the sentence piece package will generate a vocabulary file [1]. We can convert it to fairseq vocabulary file format and assign it as predefined vocabulary when running fairseq-preprocess, e.g.

Adding something like this in prepare-iwslt17-multilingual.sh:

# strip the first three special tokens and append fake counts for each vocabulary
tail -n +4 $DATA/sentencepiece.bpe.vocab | cut -f1 | sed 's/$/ 100/g' > fairseq.vocab

and when binarization, we add --tgtdict fairseq.vocab option to assign predefined vocabulary file.

[1] https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh#L105

2reactions
feralvamcommented, Nov 15, 2019

@okgrammer Hi! Did you ever find out the purpose of that loop for LANG in "$SRC" "$TGT"; do? Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Joined dictionary for more than 2 languages #859 - GitHub
One workaround that I did is to combine the training data from all languages, then call fairseq-preprocess once to generate a joined dictionary....
Read more >
Polyglot - Definition, Meaning & Synonyms - Vocabulary.com
The –glot comes from the Greek word for “tongue,” and the prefix poly- means “more than one,” so if you speak two or...
Read more >
Polyglot Definition & Meaning - Merriam-Webster
adjective ; 1. a. : speaking or writing several languages : multilingual. b. : composed of numerous linguistic groups. a polyglot population ;...
Read more >
How can I have multiple languages in my iPhone dictionary?
1 Answer 1 · 1. will autocomplete work for both languages without having to switch over - ie, can i mix my words...
Read more >
Bilingual dictionary - Wikipedia
A bilingual dictionary or translation dictionary is a specialized dictionary used to translate words or phrases from one language to another.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found