question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Selective node splitting

See original GitHub issue

So @gloryknight came up with a very interesting and simple heuristic that apparently works really well in many JSON files: whenever a , is encountered (can be set to another character, like a space or newline), LZ is forced to stop the current substring, and start anew (with a leading comma), except when the current substring starts with a comma. See these two pull-requests:

https://github.com/JobLeonard/lz-string/pull/3

https://github.com/JobLeonard/lz-string/pull/4

The reason for this being effective is a bit subtle: imagine that we have a string we are scanning through, and the next set of characters will be abcdefg. Furthermore, our dictionary already has the substrings abc, abcd and defg (plus the necessary substrings to get to this point), but not efg. Obviously, the ideal combination of tokens would be abc + defg. Instead we’ll get abcd + e + f + g. This can happen quite often in LZ. So how to avoid this? Well, I guess gloryknight’s insight was that not all characters are created equal here; they can have special functions. One of those is as separator characters. Think of natural language: or words are separated by spaces, so if we split on the space character (and similar separator like newlines, dots, commas) we would converge on identical substrings much quicker.

Since LZString is most commonly used whn compressing JSON, which strips out all unnecessary whitespace, the , is the option that seems to improve compression performance (although maybe { and : also make sense, or maybe all three?). In his tests it gave significant compression benefits at a small perf cost.

The best bit? This is perfectly backwards compatible with previous codes: the output can be decompressed by the same function as before.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:3
  • Comments:15 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
Rycochetcommented, Sep 20, 2018

I can see that being far better for the data - though I’d think that the double quote character may give slightly better compression - there’s likely going to be one token of ": " at least - but also all keys might get found, and not just the second onwards. - really not got time to look at it properly, but it “feels” right as a technique regardless of the character(s) used.

Saying that - I think it should be behind a flag to use, it might be non-breaking, but it’s also very much meant for JSON data, and might make non-JSON a little bit larger (though a quote character may be less subject to that etc).

1reaction
gloryknightcommented, Dec 26, 2018

You are right. In this particular case pure separator splitting will produce worse results but we are using a trick and do not split entries starting with a separator. This improves the result (due to faster convergence). A more complex case could be:

AAA,1,AAA,2,AAA,3,AAA,4,AAA,5,AAA,6,AAA

This does lead to many extra ,AAA entries in the dictionary. The last post eliminates these repetitions. Again, it all depends on input data. Sometimes the gain is negative 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Node Splitting - an overview | ScienceDirect Topics
Splitting nodes stops when quadratic error per node drops below qetoler * qed, where qed is the quadratic error for the entire data...
Read more >
4 Simple Ways to Split a Decision Tree in Machine Learning
A decision tree makes decisions by splitting nodes into sub-nodes. This process is performed multiple times during the training process ...
Read more >
Automated generation of node‐splitting models for ... - NCBI
In Figure 3(c), all comparisons have pair‐wise evidence, and thus, all comparisons are selected to be split. From the loop inconsistency perspective, splitting...
Read more >
11.2 Splitting Criteria | Introduction to Data Science
Homogeneity means that most of the samples at each node are from one class. The original CART algorithm uses Gini impurity as the...
Read more >
Semi-supervised Node Splitting for Random Forest Construction
Node splitting is an important issue in Random Forest but robust splitting requires a large ... A unified optimization framework is proposed to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found