Tokenizer inconsistencies
See original GitHub issueHi! I’m getting a strange inconsistency when reading contraction tokens. All strings should be split into three tokens, but only the first one is. This is a simplified example.:
How to reproduce the behaviour
import spacy
nlp = spacy.load('en_core_web_sm')
examples = [
"That'll do.",
"This'll do.",
"Those'll do.",
"These'll do.",
]
for example in examples:
print(list(nlp(example)))
Prints out:
[That, 'll, do, .]
[This'll, do, .]
[Those'll, do, .]
[These'll, do, .]
Is this a training issue or is it an expected result?
Your Environment
- spaCy version: 2.2.4
- Platform: Linux-4.15.0-101-generic-x86_64-with-glibc2.27
- Python version: 3.8.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Inconsistencies and possible bugs in different tokenizers #3788
We use the tokenizers to encode strings with special tokens. However, we note some inconsistencies: (1) Most, but not all encodings do not ......
Read more >Inconsistencies between BERT and RoBERTa: what am I ...
Hello, I was trying to have a very rapid and brief test with a simple pipeline that I got from the HuggingFace's course....
Read more >python - Inconsistent vector representation using transformers ...
I have a BertTokenizer ( tokenizer ) and a BertModel ( model ) from the transformers library. I have pre-trained the model from...
Read more >How to Build a WordPiece Tokenizer For BERT
The first step for many in designing a new BERT model is the tokenizer. In this article, we'll look at the WordPiece tokenizer...
Read more >Weaknesses of WordPiece Tokenization | by Rick Battle
... significant performance improvement on our internal benchmarks with no model changes — solely by correcting the tokenization errors ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I just sent two pull requests: the first one’s related to this issue, and another one adding a “c’mon” token exception. I hope everything will work just fine. Thanks for the help!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.