question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Remove Diacritics for Urdu Language

See original GitHub issue

I would like to contribute for Urdu language support, let me start with simple issue now,

For the Urdu text with diacritics text = “اِس, اُس”

following code produces incorrect output

import pandas as pd 
import re
import texthero as hero
text = "اِس, اُس"
s = pd.Series(text)
s1 = hero.remove_diacritics(s)
s1
is, us

produces the output is, us which is not the intended., but it is transliterated output.

The intented output is اس, اس

Probably which can be acheived by replacing following diacritics char

Urdu Diacritics zabar = u’\u064e’ pesh = u’\u064f’ zer = u’\u0650’ tashdid = u’\u0651’ jazam = u’\u0652’

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
henrifroesecommented, Jul 13, 2020

I believe the issue is that the proposed changes are on the Fix-Remove-Diacritics branch of the repository. With your command, pip uses the master branch where the changes are not yet implemented. You can try using !pip install git+https://github.com/SummerOfCode-NoHate/texthero.git@Fix_Remove_Diacritics, I think that should work. If not, you could just copy the functions I pasted in #72 and try them out directly.

0reactions
cmhashimcommented, Jul 13, 2020
from texthero.ur import hero
hero.remove_diacritics(...)

Where this remove_diacritics is specialized in dealing with Urdu text.

I think its time to do this. It was my mistake i gave very simple example of Urdu text with diacritics, but it much more complex to handle diacritics in Urdu/Arabic. Some diacritics are part of Urdu words, and it must be written, and some can be excluded. Hence, can we have a optional argument, to exclude/include a list of diacritics to retain/remove it.

Some Examples:

retain_diacritics_eg_text = “فوراً, حتیٰ, آزاد, ہوئی”

Read more comments on GitHub >

github_iconTop Results From Across the Web

diacritics-removal · GitHub Topics
Language : All. Filter by language ... A lightweight Rust library for removing Arabic diacritics ... Normalize and transform diacritics, dashes, and spaces....
Read more >
Optical Character Recognition System for Urdu (Naskh Font ...
Urdu language forms words by combining Isolated Characters. ... When the shape of the character is obtained after removing diacritics, we calculated the....
Read more >
diacritics are not removed from ǢǣǼǽǮǯ [#3151364] - Drupal
removeDiacritics only uses the output of transliteration if it's a single character and so it leaves Ǣ alone. This, however, is incorrect: we ......
Read more >
Urdu (a) character set and (b) diacritical marks - ResearchGate
This paper describes an implementation of the Urdu language as a software API, and we deal with orthography, morphology and the ext... View....
Read more >
What's with the odd selection of diacritics on the Urdu ... - Quora
In Urdu, diacritics are used in writing only when an obscure word is to be clarified or if a confusion is anticipated because...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found