NLP: Add WordPiece/SentencePiece tokenizer/detokenizer, training?
See original GitHub issueWhat: WordPiece is an unsupervised multi-lingual text tokenizer. It is used in models such as BERT, though can in principle be used for many NLP models. It produces a user-specified fixed-size vocabulary for NLP tasks that mixes both word and sub-word tokens. The approach also allows encoding of (what would normally be) out-of-vocabulary words as a sequence of sub-word tokens (which include single characters, if no larger tokens are appropriate).
WordPiece: the closed source version used for training BERT. SentencePiece: Open source version: https://github.com/google/sentencepiece
There are three aspects to consider here:
- Encoding: tokenization, assigning sub-word indices according to a given vocabulary
- Decoding: sub-word indices -> tokens -> text
- Generating new vocabulary from raw text
In the short term, we’ll need need to implement (1) and (2) to train train our own, or use imported, BERT models.
Encoding should be straightforward: https://github.com/google-research/bert/blob/master/tokenization.py#L300-L359
Decoding is trivial: indices -> list of tokens -> concatenate -> replace underscores
Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.
detokenized = ''.join(pieces).replace('_', ' ')
Also, wrapping this in a DataSetIterator useful for DL4J and SameDiff training would be useful (fine-tuning BERT models etc).
Implementing new vocabularies is a little more difficult. However, in many cases this will not be necessary: we can use existing vocabularies, such as those released in the pretrained BERT models. These vocabularies are not BERT specific - they are generated in an unsupervised way.
If/when we do need to implement new vocabularies (such as a single language vocab) we can do the following:
- Start with the existing vocabularies, and prune all non-required language tokens, OR
- Wrap the SentencePiece c++ code (JavaCPP bindings presumably) - https://github.com/google/sentencepiece/tree/master/src
There’s nothing stopping us from implementing the algorithms ourselves, but using the existing implementation (which is Apache 2.0) would be the more sensible first (and perhaps only) step.
For more details: Discussion + use of WordPiece: https://arxiv.org/abs/1609.08144 Paper introducing approach + training algorithm: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf BERT model: https://arxiv.org/abs/1810.04805
Aha! Link: https://skymindai.aha.io/features/DL4J-14
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (5 by maintainers)
I’ve added the bert word piece tokenizer in this PR:
https://github.com/deeplearning4j/deeplearning4j/pull/7141
The difference in practice is probably that we can’t use the simple handling of punctuation and whitespace that is possible in the bert tokenizer. But if we add sentencepiece via javacpp we probably don’t really care, as they do all of that themselves anyway.
👍
Ok, I’m trying to see the actual impact here, and whether than matters in practice… the only practical difference I can come up with is tokens like [_New_York] could in principle be present in SentencePiece but not WordPiece. Otherwise they should be basically the same. The decoding algorithm should be identical I think. Encoding differs only in that you’re encoding the full string in one go, rather than encoding each word separately (i.e., the greedy encoding algorithm is otherwise the same).
That makes more sense.