Custom Multilingual Tokenizer
See original GitHub issueI started working on adding support to Arabic Language #2314
A large number of NLP tasks require normalizing the text. For Arabic Content, it includes:
- Removing all the forms of punctuation and diacritics.
- Dealing with some inconsistent variations. The shapes of Arabic letters vary according to their positions in the word, their widths, and number. Some of the letters can be written interchangeably like different forms, like alif, alif maqsurah and regular dotted Yaa, taa marbutah and haa.
In order to reduce noise and data sparsity when training model. I was thinking of writing some normalization functions that provide a different level of orthographic normalization. I’d like to know how spaCy handle text normalization for different languages ?, what is the ideal way to include these functions. Some of these functions can be used for other languages as well (e.g: Persian, … ).
I was thinking about the two choices. (1) One can code these and add them as exceptions under lang/ar/
or (2) leverage the new custom pipeline components. In order to better illustrate some of the use cases, I have coded one of the functionalities remove_diacritics
as an extension.
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.ar import Arabic
import re
from spacy.tokens import Doc, Span, Token
nlp = Arabic()
tokens = nlp(u'رَمَضَانُ كْرِيمٌ')
all_diacritics = u"[\u0640\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0670]"
remove_diacritics = lambda token: re.sub(all_diacritics, '', token.text)
Token.set_extension('without_diacritics', getter=remove_diacritics)
Doc.set_extension('without_diacritics', getter=remove_diacritics)
print([(token.text, token._.without_diacritics) for token in tokens])
assert tokens[0]._.without_diacritics == u"رمضان"
assert tokens[1]._.without_diacritics == u"كريم"
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:6 (2 by maintainers)
Hi @khaledJabr
I think you were initialising a tokenizer using only the nlp object’s vocab
nlp = Tokenizer(nlp.vocab)
, and you were not using the tokenization rules. In order to apply Arabic tokenization rules, you can load arabic tokenizernlp = Arabic()
and then simply process the text by callingnlp
.Moreover, you can filter out stop words and punctuations by checking the field
token.is_punct
&token.is_stop
In case you want to normalize Arabic content to deal with special cases (remove diacritics, dealing with some inconsistent variations, … etc ), you can check out an Arabic Custom tokenizer (Spacy Component) that can be found as part of Daysam to process and parse Arabic text.
Hello Tahar; I have seen the great contribution you made to support Arabic language in spaCy, I am interested in this, and I want to know what you are working on now and how I can help?