Add hero.infer_lang(s)
See original GitHub issue(Edit)
Add a function hero.infer_lang(s)
(a suggestion for a better function name is more than welcomed!) that given a Pandas Series finds for each row the respective language.
Implementation
- Probably, we will need to define an
_infer_lang
function inside the mother function that will take as input a text and return the lang of the text. Theninfer_lang
will justapply
it tos
. - Searching in Google for infer language python might be a good start.
- There are probably two ways to solve: rule-based and model-based. If rule-based has high accuracy, then we can stick to this solution as it’s probably faster.
- Add it under
nlp.py
Improvement
A more complex solution would not only return the lang
but also the probability or even better a dictionary like this one (or similar):
{
'en': 0.8,
'fr': 0.1,
'es': 0.0
}
Expected PR
- Should motivate the reason of a chosen algorithm or external library
- Explain how complex would it be to add the “improvement”
- Show a concrete example on a large and multilingual dataset or at least give proof that it works well (for example citing that under-the-hood the function use package X that achieved Y accuracy on …)
- All other requirements as stated in CONTRIBUTING.md
- Removed
good first issue
label
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Use the Hero web part - Microsoft Support
Bring focus and visual interest to your page and video with the Hero web part. You can display up to five items in...
Read more >How to add a hero section to the top of your WordPress page
This Secret WordPress Design Tip Will Make Your Designs Look Professional · Gutenberg Block Editor tips and tricks · Unforgettable WordPress ...
Read more >How To Create a Hero Image - W3Schools
A Hero Image is a large image with text, often placed at the top of a webpage: ... Use "linear-gradient" to add a...
Read more >Creating a Hero Section in WordPress - Toolset
Add a Full width Section. Insert the Container block. In the right sidebar, expand the Background section and set the Type option to...
Read more >Customize the Hero Component - Salesforce Help
The hero component consists of a search box, background image, and title text on the... ... Close Close. Search. Search. Filters (0) Add....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi Jonathan,
Good to hear from you, according to the Dask.dataframe, in our current implementation we using pandas.Series as an input and not pandas.dataframe. Do you think DataFrame is better for our needs? how it speeds up the process?
(Edited) Dear Jonathan, @jbesomi After doing some research, I present to you my findings:
I found a StackOverflow response (https://stackoverflow.com/a/47106810) that summary all open source libraries that deal with inference language, so I do a performance test for all those libraries.
Dataset: (From http://www.statmt.org/europarl/) 21,000 senteces
21 languages : English, Bulgarian, Czech, Danish, German, Greek, Spanish,Estonian,Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian Slovak, Slovene, Swedish
±---------------±---------------±-----------------+ | library | average time | accuracy score | +================+================+==================+ | spaCy | 0.01438 sec | 86.181% | ±---------------±---------------±-----------------+ | langid | 0.00107 sec | 86.100% | ±---------------±---------------±-----------------+ | langdetect | 0.00606 sec | 86.308% | ±---------------±---------------±-----------------+ | fasttext | 0.02765 sec | 84.433% | ±---------------±---------------±-----------------+ | cld3 | 0.00026 sec | 84.973% | ±---------------±---------------±-----------------+ | guess_language | 0.00079 sec | 75.481% | ±---------------±---------------±-----------------+
Failed: TextBlob - server request based, can’t handle multiply requests in parallel. polyglot - didn’t succeed to install this library on my local machine (mac). Chardet - failed to detect the language in most of the sentences. succeed but have an issue: LangDetect- In multiply examples, it raised an Error “No features in the text.”
Finally, I suggest using the LangId (need python =<3.6) or spaCy library, which has the best performance for our uses. According to the improvement, I will suggest using FastText or cld3 library, because they can compute the K most frequent languages (get K as a parameter) of the example with their probabilities. What do you think about my suggestions?
Dataset: sentences.all.zip