Remove Diacritics for Urdu Language
See original GitHub issueI would like to contribute for Urdu language support, let me start with simple issue now,
For the Urdu text with diacritics text = “اِس, اُس”
following code produces incorrect output
import pandas as pd
import re
import texthero as hero
text = "اِس, اُس"
s = pd.Series(text)
s1 = hero.remove_diacritics(s)
s1
is, us
produces the output is, us which is not the intended., but it is transliterated output.
The intented output is اس, اس
Probably which can be acheived by replacing following diacritics char
Urdu Diacritics zabar = u’\u064e’ pesh = u’\u064f’ zer = u’\u0650’ tashdid = u’\u0651’ jazam = u’\u0652’
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (2 by maintainers)
Top Results From Across the Web
diacritics-removal · GitHub Topics
Language : All. Filter by language ... A lightweight Rust library for removing Arabic diacritics ... Normalize and transform diacritics, dashes, and spaces....
Read more >Optical Character Recognition System for Urdu (Naskh Font ...
Urdu language forms words by combining Isolated Characters. ... When the shape of the character is obtained after removing diacritics, we calculated the....
Read more >diacritics are not removed from ǢǣǼǽǮǯ [#3151364] - Drupal
removeDiacritics only uses the output of transliteration if it's a single character and so it leaves Ǣ alone. This, however, is incorrect: we ......
Read more >Urdu (a) character set and (b) diacritical marks - ResearchGate
This paper describes an implementation of the Urdu language as a software API, and we deal with orthography, morphology and the ext... View....
Read more >What's with the odd selection of diacritics on the Urdu ... - Quora
In Urdu, diacritics are used in writing only when an obscure word is to be clarified or if a confusion is anticipated because...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I believe the issue is that the proposed changes are on the Fix-Remove-Diacritics branch of the repository. With your command, pip uses the master branch where the changes are not yet implemented. You can try using
!pip install git+https://github.com/SummerOfCode-NoHate/texthero.git@Fix_Remove_Diacritics
, I think that should work. If not, you could just copy the functions I pasted in #72 and try them out directly.I think its time to do this. It was my mistake i gave very simple example of Urdu text with diacritics, but it much more complex to handle diacritics in Urdu/Arabic. Some diacritics are part of Urdu words, and it must be written, and some can be excluded. Hence, can we have a optional argument, to exclude/include a list of diacritics to retain/remove it.
Some Examples:
retain_diacritics_eg_text = “فوراً, حتیٰ, آزاد, ہوئی”