Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different sentence spans on Document and Token level

See original GitHub issue

How to reproduce the behaviour

I would like to extract the sentence index of a token in a doc. The current workaround uses token.sent and comparing the span with the sentence list of the doc.

Issue: using token.sent results in some cases in different sentence spans than sentences from doc.sents:

import spacy
nlp = spacy.load("en_core_web_md")
text = "Very satisfied!. This product definitely met my expectations. I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light and I have a case on it now anyway), brand new screen with screen protector, and works like new. I have had no problems with it at all. I ordered it and I was scheduled to receive it a week later, but it was in my mailbox four days early. I am extremely satisfied with this product as well as this company. I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. Thank you so much for your honest business. I am a very satisfied customer! :)"

doc = nlp(text)
sentences = [sent for sent in doc.sents]
token = doc[14] #refurbished
token.sent == sentences[2] # False

sentences[2] # I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light
token.sent # I ordered a refurbished iPhone 4s and it was exactly like it was described:

Your Environment

Python Version Used: 3.8.2
spaCy Version Used: 2.2.4
Environment Information: Docker image python:3 (Linux)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

tobiasblasbergcommented, May 14, 2020

Here is another strange behavior of token.sent, where the token is part of the span of token.sent:

for token in doc:
    print(token.sent, token)

Part of the output:

… I ordered a refurbished iPhone 4s and it was exactly like it was described: I I ordered a refurbished iPhone 4s and it was exactly like it was described: ordered I ordered a refurbished iPhone 4s and it was exactly like it was described: a I ordered a refurbished iPhone 4s and it was exactly like it was described: refurbished I ordered a refurbished iPhone 4s and it was exactly like it was described: iPhone I ordered a refurbished iPhone 4s and it was exactly like it was described: 4s I ordered a refurbished iPhone 4s and it was exactly like it was described: and I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: was I ordered a refurbished iPhone 4s and it was exactly like it was described: exactly I ordered a refurbished iPhone 4s and it was exactly like it was described: like I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: was I ordered a refurbished iPhone 4s and it was exactly like it was described: described I ordered a refurbished iPhone 4s and it was exactly like it was described: : I ordered a refurbished iPhone 4s and it was exactly like it was described: minor I ordered a refurbished iPhone 4s and it was exactly like it was described: scratches I ordered a refurbished iPhone 4s and it was exactly like it was described: on I ordered a refurbished iPhone 4s and it was exactly like it was described: the I ordered a refurbished iPhone 4s and it was exactly like it was described: back I ordered a refurbished iPhone 4s and it was exactly like it was described: ( I ordered a refurbished iPhone 4s and it was exactly like it was described: you I ordered a refurbished iPhone 4s and it was exactly like it was described: can I ordered a refurbished iPhone 4s and it was exactly like it was described: not I ordered a refurbished iPhone 4s and it was exactly like it was described: see I ordered a refurbished iPhone 4s and it was exactly like it was described: them I ordered a refurbished iPhone 4s and it was exactly like it was described: unless I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: has I ordered a refurbished iPhone 4s and it was exactly like it was described: the I ordered a refurbished iPhone 4s and it was exactly like it was described: right I ordered a refurbished iPhone 4s and it was exactly like it was described: kind I ordered a refurbished iPhone 4s and it was exactly like it was described: of I ordered a refurbished iPhone 4s and it was exactly like it was described: light …

0reactions

github-actions[bot]commented, Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

What is the difference between token and span (a slice ...

From spaCy's documentation, a Token represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span is a slice ......

Span · spaCy API Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....

Tokenization - Trankit's Documentation

For each token, there are two types of span that we can access: (i) Document-level span (via 'dspan' ) and (ii) Sentence-level span...

Span-Level Model for Relation Extraction

level task have been token-level models which ... tion for the sentence, ”Washington, D.C. is the ... sible spans in the input document....

Data Objects and Annotations - Stanza - Stanford NLP Group

Document ; Sentence; Token; Word; Span; ParseTree; Adding new properties to ... A Word object holds a syntactic word and all of its...