question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Indian Tokenizer should recognise deergha virama ("॥") as a SINGLE token.

See original GitHub issue

There are two special punctuation marks in Indian Languages, namely the the purna virama (“|”) and deergha virama (“॥”), for Indian language scripts. While indian_tokenizer.py does a good job in tokenizing, it should recognise the deergha virama (“॥”) as a single token instead of two tokens.

>>> from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex as i_word
#Sanskrit
>>> sentence = "अग्निमीळे पुरोहितं यज्ञस्य देवं रत्वीजम |होतारं रत्नधातमम ||"
>>> sanskrit_text_tokenize = i_word(sentence)
>>> sanskrit_text_tokenize 
['अग्निमीळे', 'पुरोहितं', 'यज्ञस्य', 'देवं', 'रत्वीजम', '|', 'होतारं', 'रत्नधातमम', '|', '|']
#Bengali
>>> sentence = "রাজপণ্ডিত হব মনে আশা করে | সপ্তশ্লোক ভেটিলাম রাজা গৌড়েশ্বরে ||">>> bengali_text_tokenize = i_word(sentence)>>> bengali_text_tokenize 
['রাজপণ্ডিত', 'হব', 'মনে', 'আশা', 'করে', '|', 'সপ্তশ্লোক', 'ভেটিলাম', 'রাজা', 'গৌড়েশ্বরে', '|', '|']

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
lazycoder1commented, Mar 9, 2017

In Sanskrit especially in poems, paragraphs end with || . I dont know about other languages though.

1reaction
nikheelpandeycommented, Mar 9, 2017

Well at least in Hindi, the sentence usually ends up with’|'. I mean I have always seen the sentences ended with a ‘|’ but have never seen something ending with ‘||’ unless it is a peculiar form of poem writing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

tokenize Package - Indic NLP Library 0.2 documentation
The sentence splitter can identify non-breaking phrases like single letter, common abbreviations/honorofics for some Indian languages. Parameters: text (str) – ...
Read more >
Summary of the tokenizers - Hugging Face
As we saw in the preprocessing tutorial, tokenizing a text is splitting it into words or subwords, which then are converted to ids...
Read more >
Indic NLP Library - Anoop Kunchukuttan
A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and ......
Read more >
NLP Libraries For Indian Languages - Analytics Vidhya
Identify the language of a text. Knowing what language a particular text is written in can be very useful when building vernacular applications ......
Read more >
4. Tokenization - Applied Natural Language Processing in the ...
txt file or something else that is read into a Python object. The output is a sequence of tokens. One of the main...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found