Use languages' alphabets to make detection more accurate
See original GitHub issueЧто это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.
Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn’t include as many as five letters from this sentence, namely: ю, ц, і, є and ї.
I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:15 (5 by maintainers)
Top Results From Across the Web
Smarter AI to Identify Languages: When Scripts Are Not Enough
It's true that Russian language uses letters from the Cyrillic script, but the same is true for more than 20 languages around the...
Read more >An efficient language detection model using Naive Bayes
A good idea would be to have a model that detects the language of a text even if this text contains words that...
Read more >Detection of Alphabets for Machine Translation of Sign ...
Detection of Alphabets for Machine Translation of Sign Language Using Deep Neural Net. Abstract: Recognition of sign language by hand gestures is one...
Read more >The most accurate natural language detection library for Python
This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If...
Read more >Feature detection and letter identification - ScienceDirect.com
Seeking to understand how people recognize objects, we have examined how they identify letters. We expected this 26-way classification of familiar forms to ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!
@wooorm Yes, ı and İ are specific to Turkish.