question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add hero.infer_lang(s)

See original GitHub issue

(Edit)

Add a function hero.infer_lang(s) (a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.

Implementation

  1. Probably, we will need to define an _infer_lang function inside the mother function that will take as input a text and return the lang of the text. Then infer_lang will just apply it to s.
  2. Searching in Google for infer language python might be a good start.
  3. There are probably two ways to solve: rule-based and model-based. If rule-based has high accuracy, then we can stick to this solution as it’s probably faster.
  4. Add it under nlp.py

Improvement

A more complex solution would not only return the lang but also the probability or even better a dictionary like this one (or similar):

{
   'en': 0.8, 
   'fr': 0.1, 
   'es': 0.0
}

Expected PR

  1. Should motivate the reason of a chosen algorithm or external library
  2. Explain how complex would it be to add the “improvement”
  3. Show a concrete example on a large and multilingual dataset or at least give proof that it works well (for example citing that under-the-hood the function use package X that achieved Y accuracy on …)
  4. All other requirements as stated in CONTRIBUTING.md
  • Removed good first issue label

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
tmankitacommented, Aug 16, 2020

Hi Jonathan,

Good to hear from you, according to the Dask.dataframe, in our current implementation we using pandas.Series as an input and not pandas.dataframe. Do you think DataFrame is better for our needs? how it speeds up the process?

1reaction
tmankitacommented, Jul 13, 2020

(Edited) Dear Jonathan, @jbesomi After doing some research, I present to you my findings:

I found a StackOverflow response (https://stackoverflow.com/a/47106810) that summary all open source libraries that deal with inference language, so I do a performance test for all those libraries.

Dataset: (From http://www.statmt.org/europarl/) 21,000 senteces
21 languages : English, Bulgarian, Czech, Danish, German, Greek, Spanish,Estonian,Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian Slovak, Slovene, Swedish

±---------------±---------------±-----------------+ | library | average time | accuracy score | +================+================+==================+ | spaCy | 0.01438 sec | 86.181% | ±---------------±---------------±-----------------+ | langid | 0.00107 sec | 86.100% | ±---------------±---------------±-----------------+ | langdetect | 0.00606 sec | 86.308% | ±---------------±---------------±-----------------+ | fasttext | 0.02765 sec | 84.433% | ±---------------±---------------±-----------------+ | cld3 | 0.00026 sec | 84.973% | ±---------------±---------------±-----------------+ | guess_language | 0.00079 sec | 75.481% | ±---------------±---------------±-----------------+

Failed: TextBlob - server request based, can’t handle multiply requests in parallel. polyglot - didn’t succeed to install this library on my local machine (mac). Chardet - failed to detect the language in most of the sentences. succeed but have an issue: LangDetect- In multiply examples, it raised an Error “No features in the text.”

Finally, I suggest using the LangId (need python =<3.6) or spaCy library, which has the best performance for our uses. According to the improvement, I will suggest using FastText or cld3 library, because they can compute the K most frequent languages (get K as a parameter) of the example with their probabilities. What do you think about my suggestions?

Dataset: sentences.all.zip

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use the Hero web part - Microsoft Support
Bring focus and visual interest to your page and video with the Hero web part. You can display up to five items in...
Read more >
How to add a hero section to the top of your WordPress page
This Secret WordPress Design Tip Will Make Your Designs Look Professional · Gutenberg Block Editor tips and tricks · Unforgettable WordPress ...
Read more >
How To Create a Hero Image - W3Schools
A Hero Image is a large image with text, often placed at the top of a webpage: ... Use "linear-gradient" to add a...
Read more >
Creating a Hero Section in WordPress - Toolset
Add a Full width Section. Insert the Container block. In the right sidebar, expand the Background section and set the Type option to...
Read more >
Customize the Hero Component - Salesforce Help
The hero component consists of a search box, background image, and title text on the... ... Close Close. Search. Search. Filters (0) Add....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found