Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

using "noun_chunks" from custom extension

See original GitHub issue

I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow “overrides” the original spacy’s pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn’t exist for all languages so even if it is called, it doesn’t work e.g. for Slovak language)

Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn’t work.

Sample code:

import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)

print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)

print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

Would it be possible that pytextrank processes the “noun_chunks” (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc._.custom_noun_chunks - see https://github.com/explosion/spaCy/issues/3856 )?

Issue Analytics

State:
Created 4 years ago
Comments:11

Top GitHub Comments

2reactions

asajatoviccommented, Feb 24, 2020

@ceteri You can find the explanation for only [‘textrank’] showing up in nlp.pipe_names at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.

@fukidzon @ceteri Regarding the doc.noun_chunks property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you’ll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don’t. The idea behind spacy-udpipe is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both in spaCy and udpipe-spacy use the UD scheme for languages other than English and German, I believe the only required thing for doc.noun_chunks to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach using doc._.custom_noun_chunks or try implementing the syntax iterator for your language. Hope this helps to solve your issue! 😃

1reaction

andremacolacommented, May 5, 2021

Hi, I’m trying to use nlp.Defaults.syntax_iterators with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.

Is this only works with spacy_udpipe ? Which I’m not using.