Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Spacy fails to tokenize the closing parenthesis as a suffix if preceded by 8: "8)"

See original GitHub issue

How to reproduce the behaviour

import spacy; 
nlp = spacy.load('en_core_web_sm')
doc = nlp("(8)")
[tok.text for tok in doc]

returns: ['(', '8)']

Your Environment

Ubuntu 18.04.4 LTS

Operating System: Linux
Python Version Used: 3.6.9
spaCy Version Used: Spacy 2.2.3 (but also happens with Spacy 2.3.0)
Environment Information:

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

adrianeboydcommented, Jul 10, 2020

The emoticon list could definitely be shortened.

Two tips:

To see why the tokenizer is tokenizing a particular way, use nlp.tokenizer.explain:

print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('SPECIAL-1', '8)')]

To remove a special case (which are called rules internally):

nlp.tokenizer.rules = {k: v for k, v in nlp.tokenizer.rules.items() if k != "8)"}
print(nlp.tokenizer.explain("(8)"))
# [('PREFIX', '('), ('TOKEN', '8'), ('SUFFIX', ')')]

(I’m recommending doing it this way with a dict comprehension since reassigning the rules property also clears the internal tokenizer cache, which del nlp.tokenizer.rules["8)"] wouldn’t. This is not the best state of affairs, but it’s how it works for now.)

1reaction

adrianeboydcommented, Jul 13, 2020

Yes, you can use del like normal if you do it before initializing the tokenizer. It’s mainly if you’re editing the settings for a tokenizer that’s already been initialized. If you haven’t tokenized anything yet it’s also fine, but if you’ve already used the model to tokenize a few examples and are modifying it on-the-fly, you can run into some really confusing cache-related behavior.

Top Results From Across the Web

SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not ...

Use a custom tokenizer to add the r'\b\)\b' rule (see this regex demo) to infixes . The regex matches a ) that is...

Linguistic Features · spaCy Usage Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....

Tokenization error · Issue #1994 · explosion/spaCy - GitHub

Hi everyone,. I'm having issues with the Spanish tokenizer. Particularly it's overly joining characters into single tokens. When the characters ...

Natural Language Processing With spaCy in Python

In this step-by-step tutorial, you'll learn how to use spaCy. This free and open-source library for Natural Language Processing (NLP) in Python has...

spaCy - Quick Guide - Tutorialspoint

To pre-train the “token to vector (tok2vec)” layer of pipeline components. 7, Init-model. To create a new model directory from raw data. 8,...