question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ Performance and Memory Analysis for Large Dataset ] very slow for large numbers of Hits

See original GitHub issue

I am trying to run language on using this scrit

            final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH, JAPANESE, CHINESE,ITALIAN, PORTUGUESE,ARABIC,RUSSIAN,DUTCH,KOREAN,SWEDISH,HINDI,POLISH).build();

	    long start=System.currentTimeMillis();  


	    final Language detectedLanguage = detector.detectLanguageOf("Zum Vergleich kann es auch nützlich sein, diese Rankings neben einigen etwas älteren Forschungsergebnissen zu sehen. Im Jahr 2013, Common Sense Advisory zur Verfügung gestellt , eine empirische Studie basiert auf einer Wallet World Online (WOW) - definiert als ‚die gesamte wirtschaftliche Chance, sowohl online als auch offline, berechnet durch einen Anteil eines Landes BIP zu allen wichtigen Blöcken dieser Gesellschaft assoziieren. ' Hier ist, was uns ihre Studie gezeigt hat.");
//	    System.out.println(detectedLanguage.toString());
	    long end=System.currentTimeMillis();  
	    System.out.println("Time: "+ (end - start));

it’s taking 700millisecong. which is very slow. which can not be used for 10000+ files… is there any approach to get results with 1-10milliseconds?

or any function like isEnglish(). which will be true only for English…

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
the-black-knight-01commented, Jul 7, 2021

thanks, it got reduced from 600 milliseconds to 70 milliseconds

0reactions
Marcono1234commented, Jul 17, 2021

If I recall correctly the Korean, Chinese and Japanese language models are quite large. So if you know beforehand, that your input is in neither of those languages you can save quite a lot of memory by excluding them.

or any function like isEnglish(). which will be true only for English…

On the other hand, if your input text is in a language which you have not included, or which Lingua does not support, and which is similar to English, Lingua could erroneously claim that the text is in English.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What to Do When Your Data Is Too Big for Your Memory?
Another way to handle large datasets is by chunking them. That is cutting a large dataset into smaller chunks and then processing those...
Read more >
Are You Still Using Pandas to Process Big Data in 2021? Here ...
This intrigued me to do a practical experiment with Dask and Vaex and try to process a bigger than memory dataset. The dataset...
Read more >
Memory Management for Large Data Sets - NI
To do so, break large data sets into smaller sets when transporting data from one place to another - a strategy known as...
Read more >
Handling large data sets in R - AWS
The Problem with large data sets in R: · R reads entire data set into RAM all at once. Other programs can read...
Read more >
SolrPerformanceProblems - Solr - Apache Software Foundation
Even if the number of actual hits are very low, the fact that the client requests a huge number of rows will cause...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found