Selective node splitting
See original GitHub issueSo @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a ,
is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:
https://github.com/JobLeonard/lz-string/pull/3
https://github.com/JobLeonard/lz-string/pull/4
The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be abcdefg
. Furthermore, our dictionary already has the substrings abc
, abcd
and defg
(plus the necessary substrings to get to this point), but not efg
. Obviously, the ideal combination of tokens would be abc
+ defg
. Instead we’ll get abcd
+ e
+ f
+ g
. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight’s insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.
Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the ,
is the option that seems to improve compression performance (although maybe {
and :
also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.
The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:15 (2 by maintainers)
Top GitHub Comments
I can see that being far better for the data - though I’d think that the double quote character may give slightly better compression - there’s likely going to be one token of
": "
at least - but also all keys might get found, and not just the second onwards. - really not got time to look at it properly, but it “feels” right as a technique regardless of the character(s) used.Saying that - I think it should be behind a flag to use, it might be non-breaking, but it’s also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc).
You are right. In this particular case pure separator splitting will produce worse results but we are using a trick and do not split entries starting with a separator. This improves the result (due to faster convergence). A more complex case could be:
AAA,1,AAA,2,AAA,3,AAA,4,AAA,5,AAA,6,AAA
This does lead to many extra
,AAA
entries in the dictionary. The last post eliminates these repetitions. Again, it all depends on input data. Sometimes the gain is negative 😃