Tokenization of punctuation in Hebrew and other non-latin languages
See original GitHub issueWhen tokenizing Hebrew, the full stop at the end of a sentence is not tokenized, while if the sentence ends with either a question mark, an exclamation mark, or ellipses, those marks are tokenized.
Example:
from spacy.he import Hebrew
tokenizer = Hebrew().tokenizer
print(list(w.text for w in tokenizer('ืขืงืืช ืืืจืื ืืื ืจืืื ืืืืื ื.')))
# ['ืขืงืืช', 'ืืืจืื', 'ืืื', 'ืจืืื', 'ืืืืื ื.']
print(list(w.text for w in tokenizer('ืขืงืืช ืืืจืื ืืื ืจืืื ืืืืื ื?')))
# ['ืขืงืืช', 'ืืืจืื', 'ืืื', 'ืจืืื', 'ืืืืื ื', '?']
print(list(w.text for w in tokenizer('ืขืงืืช ืืืจืื ืืื ืจืืื ืืืืื ื!')))
# ['ืขืงืืช', 'ืืืจืื', 'ืืื', 'ืจืืื', 'ืืืืื ื', '!']
print(list(w.text for w in tokenizer('ืขืงืืช ืืืจืื ืืื ืจืืื ืืืืื ื..')))
# ['ืขืงืืช', 'ืืืจืื', 'ืืื', 'ืจืืื', 'ืืืืื ื', '..']
print(list(w.text for w in tokenizer('ืขืงืืช ืืืจืื ืืื ืจืืื ืืืืื ื...')))
# ['ืขืงืืช', 'ืืืจืื', 'ืืื', 'ืจืืื', 'ืืืืื ื', '...']
Info about spaCy
- spaCy version: 1.8.0
- Platform: Linux-3.13.0-74-generic-x86_64-with-debian-jessie-sid
- Python version: 3.6.1
- Installed models:
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Hebrew punctuation - Wikipedia
Hebrew punctuation is similar to that of English and other Western languages, Modern Hebrew having imported additional punctuation marks from these languages ......
Read more >13. Tokenization โ New Languages for NLP - GitHub Pages
To address this problem, spaCy has rules for how to split these chunks into tokens. In this case, it has a list of...
Read more >multilingual.md ยท a9ba4b8d7704c1ae18d1b28c56c0430d41407eb1 ...
Therefore, if your goal is to maximize performance with a language other than English or Chinese ... (b) punctuation splitting, (c) whitespace tokenization....
Read more >8.1.1. cltk.alphabet package - The Classical Language Toolkit
The hyphen is important in Latin tokenization as the enclitic particle -ne is different than the interjection ne . Parameters. text ( str...
Read more >Split text into words: extended version - R
an optional argument specifying the language of the texts analyzed. ... as well as other for a variety of non-Latin scripts, including Cyryllic,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Iโll take a shot at fixing it.
Thanks for the report. I think this is caused by the global regex rules for punctuation, some of which currently only cover latin characters. We originally chose the approach of spelling out the individual characters because it made it easier to create uppercase/lowercase sets, and kept things a bit more readable while we were tidying up the language data and inviting more people to contribute.
But now that weโre adding more and more languages, this keeps coming up so we should fix this. (If I remember correctly, this was already causing problems for people working with Bengali and developing Russian integration.)
~Iโll open a separate issue about this for spaCy v2.0, but in short:~ Never mind, just making this the master issue. Steps are:
regex
library to handle compiling the correct character classes