Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom Multilingual Tokenizer

See original GitHub issue

I started working on adding support to Arabic Language #2314

A large number of NLP tasks require normalizing the text. For Arabic Content, it includes:

Removing all the forms of punctuation and diacritics.
Dealing with some inconsistent variations. The shapes of Arabic letters vary according to their positions in the word, their widths, and number. Some of the letters can be written interchangeably like different forms, like alif, alif maqsurah and regular dotted Yaa, taa marbutah and haa.

In order to reduce noise and data sparsity when training model. I was thinking of writing some normalization functions that provide a different level of orthographic normalization. I’d like to know how spaCy handle text normalization for different languages ?, what is the ideal way to include these functions. Some of these functions can be used for other languages as well (e.g: Persian, … ).

I was thinking about the two choices. (1) One can code these and add them as exceptions under lang/ar/ or (2) leverage the new custom pipeline components. In order to better illustrate some of the use cases, I have coded one of the functionalities remove_diacritics as an extension.

import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.ar import Arabic
import re
from spacy.tokens import Doc, Span, Token

nlp = Arabic()

tokens = nlp(u'رَمَضَانُ كْرِيمٌ')

all_diacritics = u"[\u0640\u064b\u064c\u064d\u064e\u064f\u0650\u0651\u0652\u0670]"
remove_diacritics = lambda token: re.sub(all_diacritics, '', token.text)
Token.set_extension('without_diacritics', getter=remove_diacritics)
Doc.set_extension('without_diacritics', getter=remove_diacritics)

print([(token.text, token._.without_diacritics) for token in tokens])

assert tokens[0]._.without_diacritics == u"رمضان"
assert tokens[1]._.without_diacritics == u"كريم"

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:6 (2 by maintainers)

Top GitHub Comments

3reactions

tzanocommented, Jun 29, 2018

Hi @khaledJabr

I think you were initialising a tokenizer using only the nlp object’s vocab nlp = Tokenizer(nlp.vocab), and you were not using the tokenization rules. In order to apply Arabic tokenization rules, you can load arabic tokenizer nlp = Arabic() and then simply process the text by calling nlp.

Moreover, you can filter out stop words and punctuations by checking the field token.is_punct & token.is_stop

import spacy 
from spacy.lang.ar import Arabic
from spacy.tokenizer import Tokenizer
from pprint import pprint

nlp = Arabic()
text =  "ماهي أبرز التطورات السياسية، الأمنية والاجتماعية في العالم ؟"
tokens  = nlp(text)

results = [token for i, token in enumerate(tokens) if not token.is_stop and not token.is_punct]

assert results == [ماهي, أبرز, التطورات, السياسية, الأمنية, والاجتماعية, العالم]

In case you want to normalize Arabic content to deal with special cases (remove diacritics, dealing with some inconsistent variations, … etc ), you can check out an Arabic Custom tokenizer (Spacy Component) that can be found as part of Daysam to process and parse Arabic text.

2reactions

mohamed-okashacommented, May 20, 2018

Hello Tahar; I have seen the great contribution you made to support Arabic language in spaCy, I am interested in this, and I want to know what you are working on now and how I can help?

Top Results From Across the Web

Multilingual models for inference - Hugging Face

There are several multilingual models in Transformers, ... The lang2id attribute of the tokenizer displays this model's languages and their ids:.

Multi-lingual Chatbot Using Rasa and Custom Tokenizer

Tips and tricks to enhance the Rasa NLU pipeline with your own custom tokenizer for multi-lingual chatbot. I have covered the basic guide...

Cutter – a Universal Multilingual Tokenizer - CEUR-WS

A simple tokenizer that only splits texts at space char- acters already achieves a notable accuracy, although it misses unmarked token bound- aries...

Multi-language · spaCy Models Documentation

spaCy; Usage · Models · API Reference · Online Course · Custom Solutions. Community; Universe · GitHub Discussions · Issue Tracker · Stack...

Language Translation with TorchText - PyTorch

torchtext provides a basic_english tokenizer and supports other tokenizers for English (e.g. Moses) but for language translation - where multiple languages are ...