Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Indian Tokenizer should recognise deergha virama ("॥") as a SINGLE token.

See original GitHub issue

There are two special punctuation marks in Indian Languages, namely the the purna virama (“|”) and deergha virama (“॥”), for Indian language scripts. While indian_tokenizer.py does a good job in tokenizing, it should recognise the deergha virama (“॥”) as a single token instead of two tokens.

>>> from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex as i_word
#Sanskrit
>>> sentence = "अग्निमीळे पुरोहितं यज्ञस्य देवं रत्वीजम |होतारं रत्नधातमम ||"
>>> sanskrit_text_tokenize = i_word(sentence)
>>> sanskrit_text_tokenize 
['अग्निमीळे', 'पुरोहितं', 'यज्ञस्य', 'देवं', 'रत्वीजम', '|', 'होतारं', 'रत्नधातमम', '|', '|']
#Bengali
>>> sentence = "রাজপণ্ডিত হব মনে আশা করে | সপ্তশ্লোক ভেটিলাম রাজা গৌড়েশ্বরে ||">>> bengali_text_tokenize = i_word(sentence)>>> bengali_text_tokenize 
['রাজপণ্ডিত', 'হব', 'মনে', 'আশা', 'করে', '|', 'সপ্তশ্লোক', 'ভেটিলাম', 'রাজা', 'গৌড়েশ্বরে', '|', '|']

Issue Analytics

State:
Created 7 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

lazycoder1commented, Mar 9, 2017

In Sanskrit especially in poems, paragraphs end with || . I dont know about other languages though.

1reaction

nikheelpandeycommented, Mar 9, 2017

Well at least in Hindi, the sentence usually ends up with’|'. I mean I have always seen the sentences ended with a ‘|’ but have never seen something ending with ‘||’ unless it is a peculiar form of poem writing.

Top Results From Across the Web

tokenize Package - Indic NLP Library 0.2 documentation

The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages. Parameters: text (str) – ...

Summary of the tokenizers - Hugging Face

As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...

Indic NLP Library - Anoop Kunchukuttan

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and ......

NLP Libraries For Indian Languages - Analytics Vidhya

Identify the language of a text. Knowing what language a particular text is written in can be very useful when building vernacular applications ......

4. Tokenization - Applied Natural Language Processing in the ...

txt file or something else that is read into a Python object. The output is a sequence of tokens. One of the main...