Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Selective node splitting

See original GitHub issue

So @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a , is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:

https://github.com/JobLeonard/lz-string/pull/3

https://github.com/JobLeonard/lz-string/pull/4

The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be abcdefg. Furthermore, our dictionary already has the substrings abc, abcd and defg (plus the necessary substrings to get to this point), but not efg. Obviously, the ideal combination of tokens would be abc + defg. Instead we’ll get abcd + e + f + g. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight’s insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.

Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the , is the option that seems to improve compression performance (although maybe { and : also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.

The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:15 (2 by maintainers)

Top GitHub Comments

2reactions

Rycochetcommented, Sep 20, 2018

I can see that being far better for the data - though I’d think that the double quote character may give slightly better compression - there’s likely going to be one token of ": " at least - but also all keys might get found, and not just the second onwards. - really not got time to look at it properly, but it “feels” right as a technique regardless of the character(s) used.

Saying that - I think it should be behind a flag to use, it might be non-breaking, but it’s also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc).

1reaction

gloryknightcommented, Dec 26, 2018

You are right. In this particular case pure separator splitting will produce worse results but we are using a trick and do not split entries starting with a separator. This improves the result (due to faster convergence). A more complex case could be:

AAA,1,AAA,2,AAA,3,AAA,4,AAA,5,AAA,6,AAA

This does lead to many extra ,AAA entries in the dictionary. The last post eliminates these repetitions. Again, it all depends on input data. Sometimes the gain is negative 😃

Top Results From Across the Web

Node Splitting - an overview | ScienceDirect Topics

Splitting nodes stops when quadratic error per node drops below qetoler * qed, where qed is the quadratic error for the entire data...

4 Simple Ways to Split a Decision Tree in Machine Learning

A decision tree makes decisions by splitting nodes into sub-nodes. This process is performed multiple times during the training process ...

Automated generation of node‐splitting models for ... - NCBI

In Figure 3(c), all comparisons have pair‐wise evidence, and thus, all comparisons are selected to be split. From the loop inconsistency perspective, splitting...

11.2 Splitting Criteria | Introduction to Data Science

Homogeneity means that most of the samples at each node are from one class. The original CART algorithm uses Gini impurity as the...

Semi-supervised Node Splitting for Random Forest Construction

Node splitting is an important issue in Random Forest but robust splitting requires a large ... A unified optimization framework is proposed to...