question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Different sentence spans on Document and Token level

See original GitHub issue

How to reproduce the behaviour

I would like to extract the sentence index of a token in a doc. The current workaround uses token.sent and comparing the span with the sentence list of the doc.

Issue: using token.sent results in some cases in different sentence spans than sentences from doc.sents:

import spacy
nlp = spacy.load("en_core_web_md")
text = "Very satisfied!. This product definitely met my expectations. I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light and I have a case on it now anyway), brand new screen with screen protector, and works like new. I have had no problems with it at all. I ordered it and I was scheduled to receive it a week later, but it was in my mailbox four days early. I am extremely satisfied with this product as well as this company. I will probably buy another electronic device from Laptop Angels because they are very trustworthy and honest. If you are looking to buy just an iPhone for a cheaper price than what is in the store, I would tell you to buy it from Laptop Angels. Thank you so much for your honest business. I am a very satisfied customer! :)"

doc = nlp(text)
sentences = [sent for sent in doc.sents]
token = doc[14] #refurbished
token.sent == sentences[2] # False

sentences[2] # I ordered a refurbished iPhone 4s and it was exactly like it was described: minor scratches on the back (you can not see them unless it has the right kind of light
token.sent # I ordered a refurbished iPhone 4s and it was exactly like it was described:

Your Environment

  • Python Version Used: 3.8.2
  • spaCy Version Used: 2.2.4
  • Environment Information: Docker image python:3 (Linux)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
tobiasblasbergcommented, May 14, 2020

Here is another strange behavior of token.sent, where the token is part of the span of token.sent:

for token in doc:
    print(token.sent, token) 

Part of the output:

… I ordered a refurbished iPhone 4s and it was exactly like it was described: I I ordered a refurbished iPhone 4s and it was exactly like it was described: ordered I ordered a refurbished iPhone 4s and it was exactly like it was described: a I ordered a refurbished iPhone 4s and it was exactly like it was described: refurbished I ordered a refurbished iPhone 4s and it was exactly like it was described: iPhone I ordered a refurbished iPhone 4s and it was exactly like it was described: 4s I ordered a refurbished iPhone 4s and it was exactly like it was described: and I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: was I ordered a refurbished iPhone 4s and it was exactly like it was described: exactly I ordered a refurbished iPhone 4s and it was exactly like it was described: like I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: was I ordered a refurbished iPhone 4s and it was exactly like it was described: described I ordered a refurbished iPhone 4s and it was exactly like it was described: : I ordered a refurbished iPhone 4s and it was exactly like it was described: minor I ordered a refurbished iPhone 4s and it was exactly like it was described: scratches I ordered a refurbished iPhone 4s and it was exactly like it was described: on I ordered a refurbished iPhone 4s and it was exactly like it was described: the I ordered a refurbished iPhone 4s and it was exactly like it was described: back I ordered a refurbished iPhone 4s and it was exactly like it was described: ( I ordered a refurbished iPhone 4s and it was exactly like it was described: you I ordered a refurbished iPhone 4s and it was exactly like it was described: can I ordered a refurbished iPhone 4s and it was exactly like it was described: not I ordered a refurbished iPhone 4s and it was exactly like it was described: see I ordered a refurbished iPhone 4s and it was exactly like it was described: them I ordered a refurbished iPhone 4s and it was exactly like it was described: unless I ordered a refurbished iPhone 4s and it was exactly like it was described: it I ordered a refurbished iPhone 4s and it was exactly like it was described: has I ordered a refurbished iPhone 4s and it was exactly like it was described: the I ordered a refurbished iPhone 4s and it was exactly like it was described: right I ordered a refurbished iPhone 4s and it was exactly like it was described: kind I ordered a refurbished iPhone 4s and it was exactly like it was described: of I ordered a refurbished iPhone 4s and it was exactly like it was described: light

0reactions
github-actions[bot]commented, Nov 5, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is the difference between token and span (a slice ...
From spaCy's documentation, a Token represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span is a slice ......
Read more >
Span · spaCy API Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Tokenization - Trankit's Documentation
For each token, there are two types of span that we can access: (i) Document-level span (via 'dspan' ) and (ii) Sentence-level span...
Read more >
Span-Level Model for Relation Extraction
level task have been token-level models which ... tion for the sentence, ”Washington, D.C. is the ... sible spans in the input document....
Read more >
Data Objects and Annotations - Stanza - Stanford NLP Group
Document ; Sentence; Token; Word; Span; ParseTree; Adding new properties to ... A Word object holds a syntactic word and all of its...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found