Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NLP: Add WordPiece/SentencePiece tokenizer/detokenizer, training?

See original GitHub issue

What: WordPiece is an unsupervised multi-lingual text tokenizer. It is used in models such as BERT, though can in principle be used for many NLP models. It produces a user-specified fixed-size vocabulary for NLP tasks that mixes both word and sub-word tokens. The approach also allows encoding of (what would normally be) out-of-vocabulary words as a sequence of sub-word tokens (which include single characters, if no larger tokens are appropriate).

WordPiece: the closed source version used for training BERT. SentencePiece: Open source version: https://github.com/google/sentencepiece

There are three aspects to consider here:

Encoding: tokenization, assigning sub-word indices according to a given vocabulary
Decoding: sub-word indices -> tokens -> text
Generating new vocabulary from raw text

In the short term, we’ll need need to implement (1) and (2) to train train our own, or use imported, BERT models.

Encoding should be straightforward: https://github.com/google-research/bert/blob/master/tokenization.py#L300-L359

Decoding is trivial: indices -> list of tokens -> concatenate -> replace underscores

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities. detokenized = ''.join(pieces).replace('_', ' ')

Also, wrapping this in a DataSetIterator useful for DL4J and SameDiff training would be useful (fine-tuning BERT models etc).

Implementing new vocabularies is a little more difficult. However, in many cases this will not be necessary: we can use existing vocabularies, such as those released in the pretrained BERT models. These vocabularies are not BERT specific - they are generated in an unsupervised way.

If/when we do need to implement new vocabularies (such as a single language vocab) we can do the following:

Start with the existing vocabularies, and prune all non-required language tokens, OR
Wrap the SentencePiece c++ code (JavaCPP bindings presumably) - https://github.com/google/sentencepiece/tree/master/src

There’s nothing stopping us from implementing the algorithms ourselves, but using the existing implementation (which is Apache 2.0) would be the more sensible first (and perhaps only) step.

For more details: Discussion + use of WordPiece: https://arxiv.org/abs/1609.08144 Paper introducing approach + training algorithm: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf BERT model: https://arxiv.org/abs/1810.04805

Aha! Link: https://skymindai.aha.io/features/DL4J-14

Issue Analytics

State:
Created 5 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

treocommented, Feb 11, 2019

I’ve added the bert word piece tokenizer in this PR:

https://github.com/deeplearning4j/deeplearning4j/pull/7141

Ok, I’m trying to see the actual impact here, and whether than matters in practice…

The difference in practice is probably that we can’t use the simple handling of punctuation and whitespace that is possible in the bert tokenizer. But if we add sentencepiece via javacpp we probably don’t really care, as they do all of that themselves anyway.

0reactions

AlexDBlackcommented, Feb 11, 2019

To make distinguishing them a bit easier I’ll call https://github.com/google-research/bert/blob/master/tokenization.py#L300-L359 “Bert WordPiece”.

👍

What I meant is that SentencePiece doesn’t pre-tokenize on white space, while Bert WordPiece first splits on white space (see https://github.com/google-research/bert/blob/master/tokenization.py#L152-L158).

Ok, I’m trying to see the actual impact here, and whether than matters in practice… the only practical difference I can come up with is tokens like [_New_York] could in principle be present in SentencePiece but not WordPiece. Otherwise they should be basically the same. The decoding algorithm should be identical I think. Encoding differs only in that you’re encoding the full string in one go, rather than encoding each word separately (i.e., the greedy encoding algorithm is otherwise the same).

I would guess that the Start of the Sentence is handled as if it had a space at the beginning

That makes more sense.

Top Results From Across the Web

Summary of the tokenizers - Hugging Face

WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules....

Add WordPiece/SentencePiece tokenizer/detokenizer, training ...

It produces a user-specified fixed-size vocabulary for NLP tasks that mixes both word and sub-word tokens. The approach also allows encoding of ...

How to Train BPE, WordPiece, and Unigram Tokenizers from ...

Here, we'll be writing a nested for loop to train each model on the smaller dataset first followed by training on the larger...

Subword tokenizers | Text - TensorFlow

It can accept sentences as input when tokenizing. This tutorial builds a Wordpiece vocabulary in a top down manner, starting from existing words ......

Tokenizers: How machines read - FloydHub Blog

We will cover often-overlooked concepts vital to NLP, such as Byte Pair Encoding, and discuss how understanding them leads to better models.