using "noun_chunks" from custom extension
See original GitHub issueI wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow “overrides” the original spacy’s pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn’t exist for all languages so even if it is called, it doesn’t work e.g. for Slovak language)
Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn’t work.
Sample code:
import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)
print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
print("{:.4f} {:5d} {}".format(phrase.rank, phrase.count, phrase.text))
print(phrase.chunks)
# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)
print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
print("{:.4f} {:5d} {}".format(phrase.rank, phrase.count, phrase.text))
print(phrase.chunks)
Would it be possible that pytextrank processes the “noun_chunks” (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc._.custom_noun_chunks - see https://github.com/explosion/spaCy/issues/3856 )?
Issue Analytics
- State:
- Created 4 years ago
- Comments:11
Top GitHub Comments
@ceteri You can find the explanation for only [‘textrank’] showing up in
nlp.pipe_names
at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.@fukidzon @ceteri Regarding the
doc.noun_chunks
property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you’ll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don’t. The idea behindspacy-udpipe
is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both inspaCy
andudpipe-spacy
use the UD scheme for languages other than English and German, I believe the only required thing fordoc.noun_chunks
to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach usingdoc._.custom_noun_chunks
or try implementing the syntax iterator for your language. Hope this helps to solve your issue! 😃Hi, I’m trying to use
nlp.Defaults.syntax_iterators
with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.Is this only works with
spacy_udpipe
? Which I’m not using.