question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer inconsistencies

See original GitHub issue

Hi! I’m getting a strange inconsistency when reading contraction tokens. All strings should be split into three tokens, but only the first one is. This is a simplified example.:

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_sm')

examples = [ 
    "That'll do.",
    "This'll do.",
    "Those'll do.", 
    "These'll do.",
]
for example in examples:
    print(list(nlp(example)))

Prints out:

[That, 'll, do, .] 
[This'll, do, .]
[Those'll, do, .]
[These'll, do, .]

Is this a training issue or is it an expected result?

Your Environment

  • spaCy version: 2.2.4
  • Platform: Linux-4.15.0-101-generic-x86_64-with-glibc2.27
  • Python version: 3.8.0

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jonesmartinscommented, Jun 10, 2020

I just sent two pull requests: the first one’s related to this issue, and another one adding a “c’mon” token exception. I hope everything will work just fine. Thanks for the help!

0reactions
github-actions[bot]commented, Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistencies and possible bugs in different tokenizers #3788
We use the tokenizers to encode strings with special tokens. However, we note some inconsistencies: (1) Most, but not all encodings do not ......
Read more >
Inconsistencies between BERT and RoBERTa: what am I ...
Hello, I was trying to have a very rapid and brief test with a simple pipeline that I got from the HuggingFace's course....
Read more >
python - Inconsistent vector representation using transformers ...
I have a BertTokenizer ( tokenizer ) and a BertModel ( model ) from the transformers library. I have pre-trained the model from...
Read more >
How to Build a WordPiece Tokenizer For BERT
The first step for many in designing a new BERT model is the tokenizer. In this article, we'll look at the WordPiece tokenizer...
Read more >
Weaknesses of WordPiece Tokenization | by Rick Battle
... significant performance improvement on our internal benchmarks with no model changes — solely by correcting the tokenization errors ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found