Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenization of punctuation in Hebrew and other non-latin languages

See original GitHub issue

When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.

Example:

from spacy.he import Hebrew

tokenizer = Hebrew().tokenizer

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה.')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה.']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה?')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה!')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה..')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']

print(list(w.text for w in tokenizer('עקבת אחריו בכל רחבי המדינה...')))
#  ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...']

Info about spaCy

spaCy version: 1.8.0
Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
Python version: 3.6.1
Installed models:

Issue Analytics

State:
Created 6 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

2reactions

beneyalcommented, Apr 19, 2017

I’ll take a shot at fixing it.

1reaction

inescommented, Apr 19, 2017

Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.

But now that we’re adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)

~I’ll open a separate issue about this for spaCy v2.0, but in short:~ Never mind, just making this the master issue. Steps are:

get rid of explicit character list
use regex library to handle compiling the correct character classes

Top Results From Across the Web

Hebrew punctuation - Wikipedia

Hebrew punctuation is similar to that of English and other Western languages, Modern Hebrew having imported additional punctuation marks from these languages ......

13. Tokenization — New Languages for NLP - GitHub Pages

To address this problem, spaCy has rules for how to split these chunks into tokens. In this case, it has a list of...

multilingual.md · a9ba4b8d7704c1ae18d1b28c56c0430d41407eb1 ...

Therefore, if your goal is to maximize performance with a language other than English or Chinese ... (b) punctuation splitting, (c) whitespace tokenization....

8.1.1. cltk.alphabet package - The Classical Language Toolkit

The hyphen is important in Latin tokenization as the enclitic particle -ne is different than the interjection ne . Parameters. text ( str...

Split text into words: extended version - R

an optional argument specifying the language of the texts analyzed. ... as well as other for a variety of non-Latin scripts, including Cyryllic,...