Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add hero.infer_lang(s)

See original GitHub issue

(Edit)

Add a function hero.infer_lang(s) (a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.

Implementation

Probably, we will need to define an _infer_lang function inside the mother function that will take as input a text and return the lang of the text. Then infer_lang will just apply it to s.
Searching in Google for infer language python might be a good start.
There are probably two ways to solve: rule-based and model-based. If rule-based has high accuracy, then we can stick to this solution as it’s probably faster.
Add it under nlp.py

Improvement

A more complex solution would not only return the lang but also the probability or even better a dictionary like this one (or similar):

{
   'en': 0.8, 
   'fr': 0.1, 
   'es': 0.0
}

Expected PR

Should motivate the reason of a chosen algorithm or external library
Explain how complex would it be to add the “improvement”
Show a concrete example on a large and multilingual dataset or at least give proof that it works well (for example citing that under-the-hood the function use package X that achieved Y accuracy on …)
All other requirements as stated in CONTRIBUTING.md

Removed good first issue label

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

tmankitacommented, Aug 16, 2020

Hi Jonathan,

Good to hear from you, according to the Dask.dataframe, in our current implementation we using pandas.Series as an input and not pandas.dataframe. Do you think DataFrame is better for our needs? how it speeds up the process?

1reaction

tmankitacommented, Jul 13, 2020

(Edited) Dear Jonathan, @jbesomi After doing some research, I present to you my findings:

I found a StackOverflow response (https://stackoverflow.com/a/47106810) that summary all open source libraries that deal with inference language, so I do a performance test for all those libraries.

Dataset: (From http://www.statmt.org/europarl/) 21,000 senteces
21 languages : English, Bulgarian, Czech, Danish, German, Greek, Spanish,Estonian,Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian Slovak, Slovene, Swedish

±---------------±---------------±-----------------+ | library | average time | accuracy score | +================+================+==================+ | spaCy | 0.01438 sec | 86.181% | ±---------------±---------------±-----------------+ | langid | 0.00107 sec | 86.100% | ±---------------±---------------±-----------------+ | langdetect | 0.00606 sec | 86.308% | ±---------------±---------------±-----------------+ | fasttext | 0.02765 sec | 84.433% | ±---------------±---------------±-----------------+ | cld3 | 0.00026 sec | 84.973% | ±---------------±---------------±-----------------+ | guess_language | 0.00079 sec | 75.481% | ±---------------±---------------±-----------------+

Failed: TextBlob - server request based, can’t handle multiply requests in parallel. polyglot - didn’t succeed to install this library on my local machine (mac). Chardet - failed to detect the language in most of the sentences. succeed but have an issue: LangDetect- In multiply examples, it raised an Error “No features in the text.”

Finally, I suggest using the LangId (need python =<3.6) or spaCy library, which has the best performance for our uses. According to the improvement, I will suggest using FastText or cld3 library, because they can compute the K most frequent languages (get K as a parameter) of the example with their probabilities. What do you think about my suggestions?

Dataset: sentences.all.zip