Final vocabulary size is not equal to character vocabulary plus num_operations ?
See original GitHub issueFor this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t
).
Lean BPE by two num_operations, and apply it with the two generated codes (wh
and en
), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@
), not 9.
Do I calculate it wrong?
In my opinion, the equation Final vocabulary size = character vocabulary + num_operations
based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e
and n
, generates two token en
and en@@
in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@
)
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Hi Hoang,
the number of unique tokens is typically larger than the number of merge operations because you have to add the size of the character vocabulary (*2 to account for the fact that characters could be word-internal or word-final). This also explains some of the unique tokens you see, such as “फ@@”.
The numbers won’t match perfectly, because some characters may only occur word-internally for example, or because all occurrences of a character or subword may have been merged into larger subwords.
as to your question why your encoded corpus contains larger subwords that are not in the list of merge operations, this shouldn’t happen. How did you search for these mismatches?
Hi Rico,
Thanks for your detailed reply. In the beginning I naively thought we should use the list of merge operations as the vocab. Now learning from this thread I know a better way of using subword is to extract the vocab (I supposed most frequently tokens) from the encoded corpus. I actually run an experiment myself to verify this and I observed a better BLEU score with this way (Pls let me know if you have a difference experience on this). For me it is interesting to know these details under the hood. Thanks.
BTW These mismatches I found was by using a preprocessed training corpus uploaded from a team. I still have not figured out exactly why it happens but since you think it should not happen, I suppose something strange is not from subword-nmt. Thx.