question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Allow using custom languages/models for spaCy NLP

See original GitHub issue

Is your feature request related to a problem? Please describe. Other related issues: #408 #251 I trained a Chinese model for spaCy, linked it to [spacy's package folder]/data/zh (using spacy link) and want to use that for ludwig. However, when I tried to set the config for ludwig, I received an error, which tell me that there is no way to load the Chinese model.

ValueError: Key chinese_tokenizer not supported, available options: dict_keys(['characters', 'space', 'space_punct', 'underscore', 'comma', 'untokenized' (...) 'bert'])

Describe the use case By allowing using custom languages for spacy, users using other language would be able to process their texts quicker and easier.

Describe the solution you’d like Here’s the current solution…

input_features:
  -
    name: input
    type: text
    preprocessing:
      word_tokenizer: english_tokenize

…which I think could be changed to this…

input_features:
  -
    name: input
    type: text
    preprocessing:
      word_tokenizer: spacy_tokenize
      spacy_model: zh #(or en, xx, etc.)

Describe alternatives you’ve considered I’ve considered not to use spacy but to use a custom script to simply split sentences to words using some processors like “jieba”. However, by using this method I would lose nearly all benefits from NLP.

Additional context I think that’s all 😃 I don’t know whether my advice could be accepted. But if it got solved I would be very thankful. BTW since I’m not a native English speaker, there may be some mistakes. Please don’t mind it :p

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:10

github_iconTop GitHub Comments

2reactions
w4nderlustcommented, Mar 6, 2020

We are thinking about changing the way you define the tokenizer to be more flexible, and that would allow you to do what you are looking for.

In the mean time, if you are using the API, you can do the following:

from ludwig.utils.nlp_utils import language_module_registry
from ludwig.utils.strings_utils import tokenizer_registry
from ludwig.utils.strings_utils import BaseTokenizer

language_module_registry['zh'] = 'your_model_name'  # for example 'en_core_web_sm'

class ChineseTokenizer(BaseTokenizer):
    def __call__(self, text):
        return process_text(text, load_nlp_pipeline('zh'))

tokenizer_registry['chinese_tokenizer'] = ChineseTokenizer

After you do this, you can refer to chinese_tokenizer in your model configuration within the same script where you run the code above.

1reaction
brightsparccommented, Nov 18, 2020

Okay thanks @ANarayan this makes sense. @w4nderlust I have created a PR #1012 for this fix and have tested with my fork.

Also, FYI I am working on a AWS SageMaker example that I should be able to share with the community shortly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Feature Request] Allow using custom languages/models for ...
I've considered not to use spacy but to use a custom script to simply split sentences to words using some processors like "jieba"....
Read more >
Models & Languages · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Language Processing Pipelines · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Linguistic Features · spaCy Usage Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
spaCy 101: Everything you need to know
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. If you're working with a lot of text, you'll...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found