Adding multilingual support with spaCy models
See original GitHub issueIdea
Using spaCy as core NLP library for pytextrank opens up the possibility of supporting new languages other than English.
Initial analysis
Currently, as of spaCy 1.8.x there are four official languages supported: en, de, fr and more recently es.
I have performed an initial analysis and testing with two new languages: (1) German and (2) Spanish.
Of course, as with the English models, the user would need to run python -m spacy download de
or python -m spacy download es
.
According to my local tests executing the example notebook for German and Spanish, the following would be needed in pytextrank to support a new language:
-
Make lang configurable and have https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L187 loading the language identified by its ISO code.
-
[CAVEAT] If the language is available in spaCy but does not include any of the required features: (1) POS, (2) NER, and (3) noun chunking method, anything else?. pytextrank should show a warning/error message. E.g., here: https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L423 for noun_chunking or here https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L480 for NER
Current status
Out the other 3 official languages, one of them would be supported out of the box (German) and other would be supported in the next release (Spanish only lacks noun_chunking in 1.8.2, which is currently implemented on the master branch and will be in principle be included in the next release, see https://github.com/explosion/spaCy/pull/1096/commits/5b385e7d78fd955d97b59024645d2592bdbc0949) French would need to implement NER and noun_chunking.
I would be happy to contribute code and examples if needed 😃
Dani
Issue Analytics
- State:
- Created 6 years ago
- Reactions:4
- Comments:6
Top GitHub Comments
Is there a tutorial for how to use it with German already somewhere?
@ceteri Thank you!