question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. Itย collects links to all the places you might be looking at while hunting down a tough bug.

And, if youโ€™re still stuck at the end, weโ€™re happy to hop on a call to see how we can help out.

Tokenization of punctuation in Hebrew and other non-latin languages

See original GitHub issue

When tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.

Example:

from spacy.he import Hebrew

tokenizer = Hebrew().tokenizer

print(list(w.text for w in tokenizer('ืขืงื‘ืช ืื—ืจื™ื• ื‘ื›ืœ ืจื—ื‘ื™ ื”ืžื“ื™ื ื”.')))
#  ['ืขืงื‘ืช', 'ืื—ืจื™ื•', 'ื‘ื›ืœ', 'ืจื—ื‘ื™', 'ื”ืžื“ื™ื ื”.']

print(list(w.text for w in tokenizer('ืขืงื‘ืช ืื—ืจื™ื• ื‘ื›ืœ ืจื—ื‘ื™ ื”ืžื“ื™ื ื”?')))
#  ['ืขืงื‘ืช', 'ืื—ืจื™ื•', 'ื‘ื›ืœ', 'ืจื—ื‘ื™', 'ื”ืžื“ื™ื ื”', '?']

print(list(w.text for w in tokenizer('ืขืงื‘ืช ืื—ืจื™ื• ื‘ื›ืœ ืจื—ื‘ื™ ื”ืžื“ื™ื ื”!')))
#  ['ืขืงื‘ืช', 'ืื—ืจื™ื•', 'ื‘ื›ืœ', 'ืจื—ื‘ื™', 'ื”ืžื“ื™ื ื”', '!']

print(list(w.text for w in tokenizer('ืขืงื‘ืช ืื—ืจื™ื• ื‘ื›ืœ ืจื—ื‘ื™ ื”ืžื“ื™ื ื”..')))
#  ['ืขืงื‘ืช', 'ืื—ืจื™ื•', 'ื‘ื›ืœ', 'ืจื—ื‘ื™', 'ื”ืžื“ื™ื ื”', '..']

print(list(w.text for w in tokenizer('ืขืงื‘ืช ืื—ืจื™ื• ื‘ื›ืœ ืจื—ื‘ื™ ื”ืžื“ื™ื ื”...')))
#  ['ืขืงื‘ืช', 'ืื—ืจื™ื•', 'ื‘ื›ืœ', 'ืจื—ื‘ื™', 'ื”ืžื“ื™ื ื”', '...']

Info about spaCy

  • spaCy version: 1.8.0
  • Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
  • Python version: 3.6.1
  • Installed models:

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
beneyalcommented, Apr 19, 2017

Iโ€™ll take a shot at fixing it.

1reaction
inescommented, Apr 19, 2017

Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.

But now that weโ€™re adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)

~Iโ€™ll open a separate issue about this for spaCy v2.0, but in short:~ Never mind, just making this the master issue. Steps are:

  • get rid of explicit character list
  • use regex library to handle compiling the correct character classes
Read more comments on GitHub >

github_iconTop Results From Across the Web

Hebrew punctuation - Wikipedia
Hebrew punctuation is similar to that of English and other Western languages, Modern Hebrew having imported additional punctuation marks from these languages ......
Read more >
13. Tokenization โ€” New Languages for NLP - GitHub Pages
To address this problem, spaCy has rules for how to split these chunks into tokens. In this case, it has a list of...
Read more >
multilingual.md ยท a9ba4b8d7704c1ae18d1b28c56c0430d41407eb1 ...
Therefore, if your goal is to maximize performance with a language other than English or Chinese ... (b) punctuation splitting, (c) whitespace tokenization....
Read more >
8.1.1. cltk.alphabet package - The Classical Language Toolkit
The hyphen is important in Latin tokenization as the enclitic particle -ne is different than the interjection ne . Parameters. text ( str...
Read more >
Split text into words: extended version - R
an optional argument specifying the language of the texts analyzed. ... as well as other for a variety of non-Latin scripts, including Cyryllic,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found