question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How do I share the vocab between the source and target language for machine translation

See original GitHub issue

Hi I do I create combined vocabulary from the source and target fields from the multi30k dataset . I am interested in having a shared encoder which can represent source as well the target words.

SRC = Field(tokenize=tokenize_de,
            init_token='<sos>',
            eos_token='<eos>',
            lower=True,
            batch_first=True)

TRG = Field(tokenize=tokenize_en,
            init_token='<sos>',
            eos_token='<eos>',
            lower=True,
            batch_first=True)

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'),
                                                    fields=(SRC, TRG))

SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
bentrevettcommented, Jan 17, 2020

@thak123

SRC.build_vocab(train_data.src, train_data.trg, min_freq=2)
TRG.vocab = SRC.vocab

This will cause your SRC and TRG to share a single vocab. You can then use one embedding layer for both languages instead of one per language.

1reaction
thak123commented, Jan 17, 2020

omg… @bentrevett thanks for the code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to have different source and target vocabularies?
You choose vocab size for each language separately and simplest approach is just to preserve all words but usually you skip most common...
Read more >
Effective Cross-lingual Transfer of Neural Machine ...
A popular solution to this is sharing the vocab- ulary among the languages of concern (Nguyen and Chiang, 2017; Kocmi and Bojar, 2018)....
Read more >
Machine Translation: Everything You Need to Know - Lilt
The process of interlingual machine translation involves converting the source language into interlingua (an intermediate representation), then converting the ...
Read more >
Focus on the Target's Vocabulary: Masked Label ...
Words or subwords in a language pair's joint dictionary can be categorized into three classes: source, common and target using Venn Diagram ......
Read more >
Improving Zero-shot Neural Machine Translation on ...
all languages share the same vocabulary and weights, the ... translating from a source language i to a target language j.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found