[Feature Request] Allow using custom languages/models for spaCy NLP
See original GitHub issueIs your feature request related to a problem? Please describe.
Other related issues: #408 #251
I trained a Chinese model for spaCy, linked it to [spacy's package folder]/data/zh
(using spacy link
) and want to use that for ludwig. However, when I tried to set the config for ludwig, I received an error, which tell me that there is no way to load the Chinese model.
ValueError: Key chinese_tokenizer not supported, available options: dict_keys(['characters', 'space', 'space_punct', 'underscore', 'comma', 'untokenized' (...) 'bert'])
Describe the use case By allowing using custom languages for spacy, users using other language would be able to process their texts quicker and easier.
Describe the solution you’d like Here’s the current solution…
input_features:
-
name: input
type: text
preprocessing:
word_tokenizer: english_tokenize
…which I think could be changed to this…
input_features:
-
name: input
type: text
preprocessing:
word_tokenizer: spacy_tokenize
spacy_model: zh #(or en, xx, etc.)
Describe alternatives you’ve considered I’ve considered not to use spacy but to use a custom script to simply split sentences to words using some processors like “jieba”. However, by using this method I would lose nearly all benefits from NLP.
Additional context I think that’s all 😃 I don’t know whether my advice could be accepted. But if it got solved I would be very thankful. BTW since I’m not a native English speaker, there may be some mistakes. Please don’t mind it :p
Issue Analytics
- State:
- Created 4 years ago
- Comments:10
Top GitHub Comments
We are thinking about changing the way you define the tokenizer to be more flexible, and that would allow you to do what you are looking for.
In the mean time, if you are using the API, you can do the following:
After you do this, you can refer to
chinese_tokenizer
in your model configuration within the same script where you run the code above.Okay thanks @ANarayan this makes sense. @w4nderlust I have created a PR #1012 for this fix and have tested with my fork.
Also, FYI I am working on a AWS SageMaker example that I should be able to share with the community shortly.