question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NLP: Add WordPiece/SentencePiece tokenizer/detokenizer, training?

See original GitHub issue

What: WordPiece is an unsupervised multi-lingual text tokenizer. It is used in models such as BERT, though can in principle be used for many NLP models. It produces a user-specified fixed-size vocabulary for NLP tasks that mixes both word and sub-word tokens. The approach also allows encoding of (what would normally be) out-of-vocabulary words as a sequence of sub-word tokens (which include single characters, if no larger tokens are appropriate).

WordPiece: the closed source version used for training BERT. SentencePiece: Open source version: https://github.com/google/sentencepiece

There are three aspects to consider here:

  1. Encoding: tokenization, assigning sub-word indices according to a given vocabulary
  2. Decoding: sub-word indices -> tokens -> text
  3. Generating new vocabulary from raw text

In the short term, we’ll need need to implement (1) and (2) to train train our own, or use imported, BERT models.

Encoding should be straightforward: https://github.com/google-research/bert/blob/master/tokenization.py#L300-L359

Decoding is trivial: indices -> list of tokens -> concatenate -> replace underscores

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities. detokenized = ''.join(pieces).replace('_', ' ')

Also, wrapping this in a DataSetIterator useful for DL4J and SameDiff training would be useful (fine-tuning BERT models etc).

Implementing new vocabularies is a little more difficult. However, in many cases this will not be necessary: we can use existing vocabularies, such as those released in the pretrained BERT models. These vocabularies are not BERT specific - they are generated in an unsupervised way.

If/when we do need to implement new vocabularies (such as a single language vocab) we can do the following:

  1. Start with the existing vocabularies, and prune all non-required language tokens, OR
  2. Wrap the SentencePiece c++ code (JavaCPP bindings presumably) - https://github.com/google/sentencepiece/tree/master/src

There’s nothing stopping us from implementing the algorithms ourselves, but using the existing implementation (which is Apache 2.0) would be the more sensible first (and perhaps only) step.

For more details: Discussion + use of WordPiece: https://arxiv.org/abs/1609.08144 Paper introducing approach + training algorithm: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf BERT model: https://arxiv.org/abs/1810.04805

Aha! Link: https://skymindai.aha.io/features/DL4J-14

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
treocommented, Feb 11, 2019

I’ve added the bert word piece tokenizer in this PR:

https://github.com/deeplearning4j/deeplearning4j/pull/7141

Ok, I’m trying to see the actual impact here, and whether than matters in practice…

The difference in practice is probably that we can’t use the simple handling of punctuation and whitespace that is possible in the bert tokenizer. But if we add sentencepiece via javacpp we probably don’t really care, as they do all of that themselves anyway.

0reactions
AlexDBlackcommented, Feb 11, 2019

To make distinguishing them a bit easier I’ll call https://github.com/google-research/bert/blob/master/tokenization.py#L300-L359 “Bert WordPiece”.

👍

What I meant is that SentencePiece doesn’t pre-tokenize on white space, while Bert WordPiece first splits on white space (see https://github.com/google-research/bert/blob/master/tokenization.py#L152-L158).

Ok, I’m trying to see the actual impact here, and whether than matters in practice… the only practical difference I can come up with is tokens like [_New_York] could in principle be present in SentencePiece but not WordPiece. Otherwise they should be basically the same. The decoding algorithm should be identical I think. Encoding differs only in that you’re encoding the full string in one go, rather than encoding each word separately (i.e., the greedy encoding algorithm is otherwise the same).

I would guess that the Start of the Sentence is handled as if it had a space at the beginning

That makes more sense.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Summary of the tokenizers - Hugging Face
WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules....
Read more >
Add WordPiece/SentencePiece tokenizer/detokenizer, training ...
It produces a user-specified fixed-size vocabulary for NLP tasks that mixes both word and sub-word tokens. The approach also allows encoding of ...
Read more >
How to Train BPE, WordPiece, and Unigram Tokenizers from ...
Here, we'll be writing a nested for loop to train each model on the smaller dataset first followed by training on the larger...
Read more >
Subword tokenizers | Text - TensorFlow
It can accept sentences as input when tokenizing. This tutorial builds a Wordpiece vocabulary in a top down manner, starting from existing words ......
Read more >
Tokenizers: How machines read - FloydHub Blog
We will cover often-overlooked concepts vital to NLP, such as Byte Pair Encoding, and discuss how understanding them leads to better models.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found