question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding multilingual support with spaCy models

See original GitHub issue

Idea

Using spaCy as core NLP library for pytextrank opens up the possibility of supporting new languages other than English.

Initial analysis

Currently, as of spaCy 1.8.x there are four official languages supported: en, de, fr and more recently es.

I have performed an initial analysis and testing with two new languages: (1) German and (2) Spanish. Of course, as with the English models, the user would need to run python -m spacy download de or python -m spacy download es.

According to my local tests executing the example notebook for German and Spanish, the following would be needed in pytextrank to support a new language:

  1. Make lang configurable and have https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L187 loading the language identified by its ISO code.

  2. [CAVEAT] If the language is available in spaCy but does not include any of the required features: (1) POS, (2) NER, and (3) noun chunking method, anything else?. pytextrank should show a warning/error message. E.g., here: https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L423 for noun_chunking or here https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L480 for NER

Current status

Out the other 3 official languages, one of them would be supported out of the box (German) and other would be supported in the next release (Spanish only lacks noun_chunking in 1.8.2, which is currently implemented on the master branch and will be in principle be included in the next release, see https://github.com/explosion/spaCy/pull/1096/commits/5b385e7d78fd955d97b59024645d2592bdbc0949) French would need to implement NER and noun_chunking.

I would be happy to contribute code and examples if needed 😃

Dani

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:4
  • Comments:6

github_iconTop GitHub Comments

1reaction
danielp3011commented, Nov 2, 2019

Is there a tutorial for how to use it with German already somewhere?

0reactions
danielp3011commented, Nov 6, 2019

@ceteri Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-language · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
spaCy - Models and Languages - Tutorialspoint
spaCy - Models and Languages, Let us learn about the languages supported ... The command for installing model using pip with external URL...
Read more >
spaCy process document with multiple languages
I believe spacy has a multilingual model that will handle english and german in one model. Check the list of models. – Superdooperhero....
Read more >
Non-English Tools for Rasa NLU | The Rasa Blog
The spaCy models that Rasa directly supports at the time of writing include Chinese, Danish, Dutch, English, French, German, Greek, Italian, ...
Read more >
How is the support for Languages other than English? - usage
Is it possible to let a model learn segmentation? Multilingual support? honnibal (Matthew Honnibal) July 24, 2018, 8:46am #2. Prodigy uses spaCy for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found