Improve performance and reduce memory consumption
See original GitHub issueAs pointed out in #39 and #57 Lingua’s great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua. In this issue I will try to highlight some main areas where performance can be improved, some of this is already covered by #98. Note that some of the proposed changes might decrease execution speed or require some larger refactoring.
Model files
-
Instead of storing the model data in JSON format, a binary format could be used matching the in-memory format (see “In-memory models” section). This would have the advantage that:
- Lookup maps such as
Char2DoubleOpenHashMap
could be created with the expected size avoiding rehashing of the maps during deserialization. - Model file loading is faster.
- Model file sizes will be slightly smaller when encoding the frequency only once, followed by the number of ngrams which share this frequency, followed by the ngram values.
Note that even though the fastutil maps are
Serializable
, using JDK serialization might introduce unnecessary overhead and would make this library dependent on the internal serialization format of the fastutil maps. Instead the data could be written manually to aDataOutputStream
. - Lookup maps such as
Model file loading
- Use streaming JSON library. The currently used
kotlinx-serialization-json
does not seem to support streaming yet. Therefore currently the complete model files are loaded as String before being parsed. This is (likely) slow and requires large amounts of memory. Instead streaming JSON libraries such as https://github.com/square/moshi should be used. Note that this point becomes obsolete if a binary format (as described in the “Model files” section above) is used.
In-memory models
-
Object2DoubleOpenHashMap
load factor can increased from the default 0.75 to a higher value. This reduces memory usage but might slow down execution. -
Ngrams can be encoded using primitives. Since this project uses only up to fivegrams (5 chars), most of the ngrams (and for some languages even ngrams of all lengths) can be encoded as JVM primitives using bitwise operations, e.g.:
- Unigrams as
Byte
orChar
- Bigrams as
Short
orInt
- Trigrams as
Int
orLong
- Quadrigrams as
Int
orLong
- Fivegrams as
Long
or in the worst case asString
object. Note that at least for fivegrams the binary encoding should probably be offset based, so one char is the base code point and the the remaining bits of theLong
encode the offsets of the other chars to the base char. This allows encoding alphabets such as Georgian where each char is> Long.SIZE_BITS / 5
.
This might even increase execution speed since it avoids
hashCode()
andequals(...)
calls when looking up frequencies (speed-up, if any, has to be tested though). - Unigrams as
-
Reduce frequency accuracy for in-memory models and model files from 64-bit
Double
to 32-bit. This can have a big impact on memory usage, saving more than 100MB with all models preloaded. However, instead of using a 32-bitFloat
to store the frequency, a custom 32-bit encoding can (and maybe should) be used sinceFloat
‘wastes’ some bits for the sign (frequency will never be negative) and the exponent (frequency will never be >= 1.0), though this might decrease language detection speed due to the decoding overhead. -
Remove Korean fivegrams (and quadrigrams?). The Korean language models are quite large, additionally due to the large range of Korean code points a great majority (> 1.000.000 fivegrams (?)) cannot be encoded with the primitive encoding approach outlined above. Chinese and Japanse don’t seem to have quadrigram and fivegram models as well, not sure if this is due to how the languages work, but maybe it would be acceptable to drop them for Korean as well; also because detection of Korean seems to be rather unambiguous.
Runtime performance
- Remove
Alphabet
. TheAlphabet
class can probably removed,Character.UnicodeScript
seems to be an exact substitute and might allow avoiding some indirection, e.g. only lookupUnicodeScript
for aChar
once and then compare it with expected ones instead of having eachAlphabet
look upUnicodeScript
. - Avoid creation of
Ngram
objects. Similar to the primitive encoding described in “In-memory models” above,Ngram
objects created as part of splitting up the text can be avoided as well (with a different encoding). A Kotlin inline class can be used to still get type safety and have some convenience functions. Primitive encoding can only support trigrams reliably without too much overhead / too complicated encoding, but that is probably fine because since d0f7a7c211abb03885cc89febae9d77fbf640342 at most trigrams will be used for longer texts. - Instead of accessing
lazy
frequency lookup in every iteration, it might be faster to access it once at the beginning and then directly use it instead (though this could also be premature optimization).
Conclusion
With some / all of these suggestions applied memory usage can be reduced and execution speed can be increased without affecting accuracy. However, some of the suggestions might be premature optimization, and they only work for 16-bit Char
but not for supplementary code points (> 16-bit) (but the current implementation, mainly Ngram
creation, seems to have that limitation as well).
I have implemented some of these optimizations and some other minor improvements in https://github.com/Marcono1234/lingua/tree/experimental/performance. However, these changes are pretty experimental: The Git history is not very nice to look at; in some commits I fixed bugs I introduced before or reverted changes again. Additionally the unit tests and model file writing are broken. Some of the changes might also be premature optimization. Though maybe it is interesting nonetheless, it appears the memory usage with all languages being preloaded went down to about 640MB (Edit: 920MB, made a mistake in the binary encoding) on AdoptOpenJDK 11.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (6 by maintainers)
Hi @Marcono1234, thank you for your hard work that you have done now and over the past months. You have been of great help for this project. 😃 May I ask you what motivates you to contribute so much to my work? Is it just for fun, for your own learning purposes or because you need a language detector for your own work?
I will gladly study your changes and certainly backport some of them. But if some of the code turns out to be too complicated to maintain, I will most probably refrain from merging it.
Of all your changes, the greatest effect has the binary format, I suppose. Am I right? This is something I would probably adopt. Json is obviously not the best fit here.
Over the past months I have been tinkering with Lingua performance optimizations, and I think I have now got it to a somewhat maintainable state: https://github.com/Marcono1234/tiny-lingua
Would be great if you could give it a try and let me know what you think!
The changes are pretty extensive and the code is not easily maintainable anymore so I assume you, @pemistahl, won’t be interested in most of these changes. But if you notice any particular change you want to include in Lingua, feel free to let me know (or apply the change yourself). I might also try to backport some of the less extensive changes to Lingua (if you are interested).