question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

using "noun_chunks" from custom extension

See original GitHub issue

I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow “overrides” the original spacy’s pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn’t exist for all languages so even if it is called, it doesn’t work e.g. for Slovak language)

Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn’t work.

Sample code:

import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)

print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)

print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

Would it be possible that pytextrank processes the “noun_chunks” (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc._.custom_noun_chunks - see https://github.com/explosion/spaCy/issues/3856 )?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:11

github_iconTop GitHub Comments

2reactions
asajatoviccommented, Feb 24, 2020

@ceteri You can find the explanation for only [‘textrank’] showing up in nlp.pipe_names at https://github.com/TakeLab/spacy-udpipe/blob/master/spacy_udpipe/language.py#L75-L76.

@fukidzon @ceteri Regarding the doc.noun_chunks property, it is built from a dependency parsed document https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L577. If you peek a little deeper into spaCy source code, you’ll notice some languages have an implementation of a proper syntax iterator (https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx#L206, i.e. English https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7) and others don’t. The idea behind spacy-udpipe is to be a lightweight wrapper around the underlying UDPipe models. As the dependecy labels used both in spaCy and udpipe-spacy use the UD scheme for languages other than English and German, I believe the only required thing for doc.noun_chunks to work is a proper syntax iterator implementation. Taking all of this into account, I suggest you try the approach using doc._.custom_noun_chunks or try implementing the syntax iterator for your language. Hope this helps to solve your issue! 😃

1reaction
andremacolacommented, May 5, 2021

Hi, I’m trying to use nlp.Defaults.syntax_iterators with spaCy v3 with no success. My language (pt) does not have the syntax_iterator.py file in the spacy lang folder.

Is this only works with spacy_udpipe ? Which I’m not using.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to keep original noun chunk spans in Spacy after ...
It's important to keep in mind that when accessing the noun chunk spans from the custom extension, you are no longer accessing the...
Read more >
Pipeline Functions · spaCy API Documentation
merge_noun_chunks function. Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks" . Example.
Read more >
spacy lemmatization of nouns and noun chunks - Stack Overflow
I am trying to create a corpus of documents ...
Read more >
Extracting noun chunks | Python Natural Language ...
Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example,...
Read more >
7 Extracting Information from Text - NLTK
In some tasks it is useful to also consider indefinite nouns or noun chunks, such as every student or cats , and these...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found