Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer inconsistencies

See original GitHub issue

Hi! I’m getting a strange inconsistency when reading contraction tokens. All strings should be split into three tokens, but only the first one is. This is a simplified example.:

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_sm')

examples = [ 
    "That'll do.",
    "This'll do.",
    "Those'll do.", 
    "These'll do.",
]
for example in examples:
    print(list(nlp(example)))

Prints out:

[That, 'll, do, .] 
[This'll, do, .]
[Those'll, do, .]
[These'll, do, .]

Is this a training issue or is it an expected result?

Your Environment

spaCy version: 2.2.4
Platform: Linux-4.15.0-101-generic-x86_64-with-glibc2.27
Python version: 3.8.0

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

jonesmartinscommented, Jun 10, 2020

I just sent two pull requests: the first one’s related to this issue, and another one adding a “c’mon” token exception. I hope everything will work just fine. Thanks for the help!

0reactions

github-actions[bot]commented, Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.