Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[ Performance and Memory Analysis for Large Dataset ] very slow for large numbers of Hits

See original GitHub issue

I am trying to run language on using this scrit

            final LanguageDetector detector = LanguageDetectorBuilder.fromLanguages(ENGLISH, FRENCH, GERMAN, SPANISH, JAPANESE, CHINESE,ITALIAN, PORTUGUESE,ARABIC,RUSSIAN,DUTCH,KOREAN,SWEDISH,HINDI,POLISH).build();

	    long start=System.currentTimeMillis();  


	    final Language detectedLanguage = detector.detectLanguageOf("Zum Vergleich kann es auch nützlich sein, diese Rankings neben einigen etwas älteren Forschungsergebnissen zu sehen. Im Jahr 2013, Common Sense Advisory zur Verfügung gestellt , eine empirische Studie basiert auf einer Wallet World Online (WOW) - definiert als ‚die gesamte wirtschaftliche Chance, sowohl online als auch offline, berechnet durch einen Anteil eines Landes BIP zu allen wichtigen Blöcken dieser Gesellschaft assoziieren. ' Hier ist, was uns ihre Studie gezeigt hat.");
//	    System.out.println(detectedLanguage.toString());
	    long end=System.currentTimeMillis();  
	    System.out.println("Time: "+ (end - start));

it’s taking 700millisecong. which is very slow. which can not be used for 10000+ files… is there any approach to get results with 1-10milliseconds?

or any function like isEnglish(). which will be true only for English…

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

the-black-knight-01commented, Jul 7, 2021

thanks, it got reduced from 600 milliseconds to 70 milliseconds

0reactions

Marcono1234commented, Jul 17, 2021

If I recall correctly the Korean, Chinese and Japanese language models are quite large. So if you know beforehand, that your input is in neither of those languages you can save quite a lot of memory by excluding them.

or any function like isEnglish(). which will be true only for English…

On the other hand, if your input text is in a language which you have not included, or which Lingua does not support, and which is similar to English, Lingua could erroneously claim that the text is in English.