question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overlapping multi-word phrases break text representation

See original GitHub issue

When two bigrams from the gazetteer overlap in the text, the text representation of the first one is broken. Consider the following code:

import spacy.en
import spacy.matcher
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63

patterns = [
    [{LOWER: 'food'}, {LOWER: 'safety'}],
    [{LOWER: 'safety'}, {LOWER: 'standards'}],
]

nlp = spacy.en.English(tagger=False, parser=False, load_vectors=False)

nlp.matcher.add('FOOD', 'FOOD', {}, patterns)

docs = nlp('There are different food safety standards in different countries.')

for e in docs.ents:
    print(e.text, e.label_)

Output:

>>> food FOOD
>>> safety standards FOOD

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
honnibalcommented, Jan 29, 2016

Thanks! I’ll take a look at this next week.

0reactions
lock[bot]commented, May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

7. Extracting Information from Text - NLTK
first, the raw text of the document is split into sentences using a ... us to represent more than one chunk type, so...
Read more >
Representation and parsing of multiword expressions
2 Verbal multiword expressions: Idiomaticity and flexibility ... I shall divide expressions in two non-overlapping classes since they are handled.
Read more >
Phrase2Vec: Phrase embedding based on parsing
In this paper, we propose a novel phrase-based text representation method that takes into account the integrity of semantic units and utilizes vectors...
Read more >
Multiword Expression Processing: A Survey - MIT Press Direct
Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across ...
Read more >
Multi-Word Tokenization for Natural Language Processing
proach to tokenization in NLP is to split text into words using white space ... represented as noun phrases, e.g. information extraction, semantic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found