Indian Tokenizer should recognise deergha virama ("॥") as a SINGLE token.
See original GitHub issueThere are two special punctuation marks in Indian Languages, namely the the purna virama (“|”) and deergha virama (“॥”), for Indian language scripts. While indian_tokenizer.py does a good job in tokenizing, it should recognise the deergha virama (“॥”) as a single token instead of two tokens.
>>> from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex as i_word
#Sanskrit
>>> sentence = "अग्निमीळे पुरोहितं यज्ञस्य देवं रत्वीजम |होतारं रत्नधातमम ||"
>>> sanskrit_text_tokenize = i_word(sentence)
>>> sanskrit_text_tokenize
['अग्निमीळे', 'पुरोहितं', 'यज्ञस्य', 'देवं', 'रत्वीजम', '|', 'होतारं', 'रत्नधातमम', '|', '|']
#Bengali
>>> sentence = "রাজপণ্ডিত হব মনে আশা করে | সপ্তশ্লোক ভেটিলাম রাজা গৌড়েশ্বরে ||">>> bengali_text_tokenize = i_word(sentence)>>> bengali_text_tokenize
['রাজপণ্ডিত', 'হব', 'মনে', 'আশা', 'করে', '|', 'সপ্তশ্লোক', 'ভেটিলাম', 'রাজা', 'গৌড়েশ্বরে', '|', '|']
Issue Analytics
- State:
- Created 7 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
tokenize Package - Indic NLP Library 0.2 documentation
The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages. Parameters: text (str) – ...
Read more >Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >Indic NLP Library - Anoop Kunchukuttan
A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and ......
Read more >NLP Libraries For Indian Languages - Analytics Vidhya
Identify the language of a text. Knowing what language a particular text is written in can be very useful when building vernacular applications ......
Read more >4. Tokenization - Applied Natural Language Processing in the ...
txt file or something else that is read into a Python object. The output is a sequence of tokens. One of the main...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In Sanskrit especially in poems, paragraphs end with || . I dont know about other languages though.
Well at least in Hindi, the sentence usually ends up with’|'. I mean I have always seen the sentences ended with a ‘|’ but have never seen something ending with ‘||’ unless it is a peculiar form of poem writing.