Spacy fails to tokenize the closing parenthesis as a suffix if preceded by 8: "8)"
See original GitHub issueHow to reproduce the behaviour
import spacy;
nlp = spacy.load('en_core_web_sm')
doc = nlp("(8)")
[tok.text for tok in doc]
returns:
['(', '8)']
Your Environment
Ubuntu 18.04.4 LTS
- Operating System: Linux
- Python Version Used: 3.6.9
- spaCy Version Used: Spacy 2.2.3 (but also happens with Spacy 2.3.0)
- Environment Information:
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not ...
Use a custom tokenizer to add the r'\b\)\b' rule (see this regex demo) to infixes . The regex matches a ) that is...
Read more >Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >Tokenization error · Issue #1994 · explosion/spaCy - GitHub
Hi everyone,. I'm having issues with the Spanish tokenizer. Particularly it's overly joining characters into single tokens. When the characters ...
Read more >Natural Language Processing With spaCy in Python
In this step-by-step tutorial, you'll learn how to use spaCy. This free and open-source library for Natural Language Processing (NLP) in Python has...
Read more >spaCy - Quick Guide - Tutorialspoint
To pre-train the “token to vector (tok2vec)” layer of pipeline components. 7, Init-model. To create a new model directory from raw data. 8,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The emoticon list could definitely be shortened.
Two tips:
nlp.tokenizer.explain
:(I’m recommending doing it this way with a dict comprehension since reassigning the
rules
property also clears the internal tokenizer cache, whichdel nlp.tokenizer.rules["8)"]
wouldn’t. This is not the best state of affairs, but it’s how it works for now.)Yes, you can use
del
like normal if you do it before initializing the tokenizer. It’s mainly if you’re editing the settings for a tokenizer that’s already been initialized. If you haven’t tokenized anything yet it’s also fine, but if you’ve already used the model to tokenize a few examples and are modifying it on-the-fly, you can run into some really confusing cache-related behavior.