Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve performance and reduce memory consumption

See original GitHub issue

As pointed out in #39 and #57 Lingua’s great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua. In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.

Model files

Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see “In-memory models” section). This would have the advantage that:
- Lookup maps such as Char2DoubleOpenHashMap could be created with the expected size avoiding rehashing of the maps during deserialization.
- Model file loading is faster.
- Model file sizes will be slightly smaller when encoding the frequency only once, followed by the number of ngrams which share this frequency, followed by the ngram values.
Note that even though the fastutil maps are Serializable, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to a DataOutputStream.

Model file loading

Use streaming JSON library. The currently used kotlinx-serialization-json does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used. Note that this point becomes obsolete if a binary format (as described in the “Model files” section above) is used.

In-memory models

Object2DoubleOpenHashMap load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution.
Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:
- Unigrams as Byte or Char
- Bigrams as Short or Int
- Trigrams as Int or Long
- Quadrigrams as Int or Long
- Fivegrams as Long or in the worst case as String object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of the Long encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is > Long.SIZE_BITS / 5.
This might even increase execution speed since it avoids hashCode() and equals(...) calls when looking up frequencies (speed-up, if any, has to be tested though).
Reduce frequency accuracy for in-memory models and model files from 64-bit Double to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bit Float to store the frequency, a custom 32-bit encoding can (and maybe should) be used since Float ‘wastes’ some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead.
Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don’t seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.

Runtime performance

Remove Alphabet. The Alphabet class can probably removed, Character.UnicodeScript seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookup UnicodeScript for a Char once and then compare it with expected ones instead of having each Alphabet look up UnicodeScript.
Avoid creation of Ngram objects. Similar to the primitive encoding described in “In-memory models” above, Ngram objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c211abb03885cc89febae9d77fbf640342 at most trigrams will be used for longer texts.
Instead of accessing lazy frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).

Conclusion

With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit Char but not for supplementary code points (> 16-bit) (but the current implementation, mainly Ngram creation, seems to have that limitation as well).

I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about ~~640MB~~ (Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.

Issue Analytics

State:
Created 2 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

1reaction

pemistahlcommented, Aug 9, 2022

Hi @Marcono1234, thank you for your hard work that you have done now and over the past months. You have been of great help for this project. 😃 May I ask you what motivates you to contribute so much to my work? Is it just for fun, for your own learning purposes or because you need a language detector for your own work?

I will gladly study your changes and certainly backport some of them. But if some of the code turns out to be too complicated to maintain, I will most probably refrain from merging it.

Of all your changes, the greatest effect has the binary format, I suppose. Am I right? This is something I would probably adopt. Json is obviously not the best fit here.

1reaction

Marcono1234commented, Aug 9, 2022

Over the past months I have been tinkering with Lingua performance optimizations, and I think I have now got it to a somewhat maintainable state: https://github.com/Marcono1234/tiny-lingua

Would be great if you could give it a try and let me know what you think!

The changes are pretty extensive and the code is not easily maintainable anymore so I assume you, @pemistahl, won’t be interested in most of these changes. But if you notice any particular change you want to include in Lingua, feel free to let me know (or apply the change yourself). I might also try to backport some of the less extensive changes to Lingua (if you are interested).

Top Results From Across the Web

How to Optimize Your RAM For Maximum Performance

Therefore, the easiest and most efficient way of boosting your RAM is by tracking the memory usage and eliminating unnecessary processes.

Windows 10 High Memory Usage [Causes and Solutions]

This article explains what high memory usage is and how to check it. It also gives 11 methods to fix Windows 10 high...

6 Ways to Optimize or Reduce Memory Usage for Running ...

6 Ways to Optimize or Reduce Memory Usage for Running Programs · 1. Wise Memory Optimizer. WiseCleaner make a number of useful utilities...

Manage your app's memory - Android Developers

Monitor available memory and memory usage. Release memory in response to events; Check how much memory you should use ; Use more memory-efficient ......

Fix High RAM Memory Usage Issue on Windows 11/10 [10 ...

Yes, high memory usage does affect the computer performance. Gemerally, the faster the RAM, the faster the processing speed will be on the ......