question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use languages' alphabets to make detection more accurate

See original GitHub issue

Что это за язык? is a Russian sentence, which is detected as Bulgarian (bul 1, rus 0.938953488372093, mkd 0.9353197674418605). However, neither Bulgarian nor Macedonian have the letters э and ы in their alphabets.

Same with Чекаю цієї хвилини., which is Ukrainian, but is detected as Northern Uzbek with probability 1 whereas Ukrainian gets only 0.33999999999999997. However, the letters є and ї are used only in Ukrainian whereas the Uzbek Cyrillic alphabet doesn’t include as many as five letters from this sentence, namely: ю, ц, і, є and ї.

I know that Franc is supposed to be not good with short input strings, but taking alphabets into account seems to be a promising way to improve the accuracy.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:5
  • Comments:15 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
niftylettucecommented, Jun 7, 2020

@thorn0 @wooorm I would put a $50 bug bounty on this payable by PayPal if anyone had the time!

2reactions
muratcorlucommented, Sep 2, 2021

I remember there is a turkish i variant that isn’t used anywhere else as well, forgot what it was tho

@wooorm Yes, ı and İ are specific to Turkish.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Smarter AI to Identify Languages: When Scripts Are Not Enough
It's true that Russian language uses letters from the Cyrillic script, but the same is true for more than 20 languages around the...
Read more >
An efficient language detection model using Naive Bayes
A good idea would be to have a model that detects the language of a text even if this text contains words that...
Read more >
Detection of Alphabets for Machine Translation of Sign ...
Detection of Alphabets for Machine Translation of Sign Language Using Deep Neural Net. Abstract: Recognition of sign language by hand gestures is one...
Read more >
The most accurate natural language detection library for Python
This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If...
Read more >
Feature detection and letter identification - ScienceDirect.com
Seeking to understand how people recognize objects, we have examined how they identify letters. We expected this 26-way classification of familiar forms to ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found