Support Removing BPE when encoding with Sentencepiece
See original GitHub issueI’m using SentencePiece instead of subword-nmt to tokenize the data.
Problem: During evaluation the flag --remove-bpe
is useless:
In SentencePiece
- the
bpe-token
changes from@@
to▁
. - first whitespace needs to be removed and second the
bpe-token
needs to be replaced with whitespace.
Currently you can fulfill (1) by passing “▁” as an argument for the --remove-bpe
flag, but that does not eliminate the additional whitespace from (2).
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Support Removing BPE when encoding with Sentencepiece
I'm using SentencePiece instead of subword-nmt to tokenize the data. Problem: During evaluation the flag --remove-bpe is useless: In ...
Read more >Using Sentencepiece/Byte Pair Encoding on Model - Support
As for inference, you'll want to tokenize your source with your subword model (BPE / sentencepiece), infer, and detokenize the inferred target.
Read more >sentencepiece: Text Tokenization using Byte Pair Encoding ...
Description Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https: ...
Read more >Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) - Handling Rare Words with Subword Tokenization. NLP techniques, be it word embeddings or tfidf often works with a...
Read more >Normalization and pre-tokenization - Hugging Face Course
... with Transformer models (Byte-Pair Encoding [BPE], WordPiece, and Unigram), ... The normalization step involves some general cleanup, such as removing ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just following up on this, you can now detokenize sentencepiece with
--remove-bpe=sentencepiece
Seeing the same problem here in the Romanian example – if you use
--remove-bpe=sentencepiece
infairseq-generate
it removes all of the spaces from theS
T
andD
lines.