Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Final vocabulary size is not equal to character vocabulary plus num_operations ?

See original GitHub issue

For this fake corpus when engage what Its character vocabulary size is 7 (e a h w n g t ). Lean BPE by two num_operations, and apply it with the two generated codes (wh and en), we get: wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@ ), not 9. Do I calculate it wrong?

In my opinion, the equation Final vocabulary size = character vocabulary + num_operations based on the assumption that every merge operation generates one new token. But in this case, the merge operation of e and n, generates two token en and en@@ in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ?? (e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ ) I am really confused ! How to generate the final vocabulary, and how to control its size exactly ?

Issue Analytics

State:
Created 6 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

rsennrichcommented, Jan 4, 2021

Hi Hoang,

the number of unique tokens is typically larger than the number of merge operations because you have to add the size of the character vocabulary (*2 to account for the fact that characters could be word-internal or word-final). This also explains some of the unique tokens you see, such as “फ@@”.

The numbers won’t match perfectly, because some characters may only occur word-internally for example, or because all occurrences of a character or subword may have been merged into larger subwords.

as to your question why your encoded corpus contains larger subwords that are not in the list of merge operations, this shouldn’t happen. How did you search for these mismatches?

0reactions

hoangcuong2011commented, Jan 5, 2021

Hi Rico,

Thanks for your detailed reply. In the beginning I naively thought we should use the list of merge operations as the vocab. Now learning from this thread I know a better way of using subword is to extract the vocab (I supposed most frequently tokens) from the encoded corpus. I actually run an experiment myself to verify this and I observed a better BLEU score with this way (Pls let me know if you have a difference experience on this). For me it is interesting to know these details under the hood. Thanks.

BTW These mismatches I found was by using a preprocessed training corpus uploaded from a team. I still have not figured out exactly why it happens but since you think it should not happen, I suppose something strange is not from subword-nmt. Thx.

Top Results From Across the Web

Vocabulary Size and Structure

The vocabulary (lexicon) of a human language, whether spoken or signed, is more than a list of the pronunciations (phonology) and meanings (lexical ......

Vocabulary - Natural Language Processing with Machine ...

In this course, we'll be focusing on word-based vocabularies, which are much more common than their character-based counterparts. B. Tokenization#. We ...

How to find "num_words" or vocabulary size of Keras ...

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number...

Concept development and vocabulary

Key developmental milestones · 12 months: 2 words plus mummy and daddy (or equivalent in languages other than English) · 18 months: 10-50...

arXiv:2004.02334v2 [cs.CL] 5 Oct 2020

Finding the Optimal Vocabulary Size for Neural Machine Translation ... classes are not of approximately equal frequency in data.