Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discuss: File size of distributable for the browser

See original GitHub issue

I wanted to open a discussion about the direction we’re headed with regard to file size. It’s come up recently in talking to StackOverflow, but otherwise I don’t hear much about it.

First some history

In terms of “absolute” relative size we’ve grown quite a bit in the past year or so. All numbers are stated in terms of gzipped size.

When I came on board in Sept 2019: ~20kb :common subset, gzipped
Today: ~37kb :common gzipped

We’ve close to doubled the size of our common distributable. Yet in that time we also added several new languages to :common also:

Go
Kotlin
Less and SCSS
Lua
Rust
Swift
Typescript

33% of our size increase comes from just these new languages (i.e. ~20kb to 28kb). The rest comes from numerous grammar improvements, parser improvements, etc…

Here and Now / My Thoughts

We have our new “higher fidelity” initiative in #2500. Both 37kb and 20kb seem tiny to me. Yes, it’s possible to build much larger builds. The full library (with every grammars) weighs in at a whopping 272kb.

All the feedback I see here on issues is of the “please, better highlighting, highlight more, highlight better” variety… I can’t remember anyone pushing back with “the library is too large, make it smaller”. I wanted to open the topic to see if anyone has any thoughts on this.

Personally I feel that our situation now is good and that increasing the bundle even 30-40% would be a win if we end up with much more nuanced highlighting at a result. (I don’t think the size will actually increase that much though.) I don’t see how we can keep the size the same as we pursue higher fidelity and more nuanced highlighting. Many of the recent “language reboots” (LaTex, Mathematica, etc) have seen huge improvements in those grammars - but also a significant increase in the grammar size.

I still think a very “popular” use for Highlight.js is on a small website/blog where one is using a subset of the languages, not a full build (or anything even close). Then on the other end you have huge sites like Discourse and StackOverflow building larger bundles. In those cases I think the right solution (if size becomes a problem) is to lazy-load the grammars on demand. Which we’ve always made easy to do and it just got easier with my PR to eliminate all run-time dependencies between languages.

A good portion of our size is keyword bundles… There has been talk recently of whether (in some languages) we could detect CamelCaseClassThingy rather than a hard-coded list… and while we could do that it (removing some keywords) it would have detrimental effects on our auto-detection capabilities which for many languages is highly dependent on large keyword lists.

Also, there are other highlighters. It’s always been my advice that if “small size” is a key requirement for someone that Prism might be a better choice as they tend to rely a lot more on tighter regex, simpler grammars, and dependency stacking… which helps them keep the size of each grammar smaller.

So I see our library continuing to grow slowly in size with every new release… and continuing to highlight with more nuance… with of course continued improvements to the parser and auto-detect when possible.

Does anything think this is the wrong direction?
Should we have some sort of size cap on 1st party languages?
Any other thoughts?

Note: Currently every language is built as a stand-alone module - which hurts our non-compressed size - since some dependency modules end up being duplicated in the source… this should have less bearing on the final gzip size though and there are also plans to fix this in the future (when using the official build system to build a monolithic distributable).

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

joshgoebelcommented, Feb 18, 2021

My feeling is that web pages that do includes snippets of all 190 languages must be pretty rare, and that most pages must have something like 10 different languages tops – but I have no data to back this up though. But anyways, if your website is likely to contain all the languages, do use the bundled version, I agree lazy loading is not for you.

I wasn’t referring entirely to actual real-life webpages or use cases - but rather implimentor behavior, including the possibility of confusion and mistakes.

There are over 100 million of downloads of our default (common) build today (via a single CDN), which is 39 languages and 41kb gzipped.
This will be getting smaller with v11 as we rip out some languages from common.
We prevent this 190 language “mistake” now by simply not shipping an “all_languages.js”
If someone really wants to build/package this manually (from source), then they can.
I have no idea how many of those 39 are “used at a time”, but 41kb to me is not “heavy” at all IMHO.
StackOverflow adds quite a few additional to this and still comes in around only 80kb I think, which is not onerous (to most implementors)
And of course the penalty for loading all 190 with Node.js is much smaller since the code is all local and bandwidth is no issue.

So it seems the actual need here is for a tiny fraction of use cases:

Those that need LOTS of languages (where download size really starts to matter)
PLUS they aren’t using auto-detect (so they could truly really take advantage of lazy load)

Given all that, I feel right now lazy-loading is best handled outside of core.

This issue was originally created following some discussions with Stack Overflow, who are extremely size sensitive. It took 3 months before anyone else chimed in on the topic. I just don’t think many people actually need this functionality (or care about size super strongly) - and even if I’m mistaken, I don’t (at this time) see huge advantages to it being in core vs a plug-in/add-on.

It seems (esp. after we release an ESM npm package) one could very easily write a small “wrapper” package such as highlightjs-async that provided a custom index (with addl. metadata and async registration calls) and then replace/wrap key API functions with async versions:

highlight
highlightBlock
registerLanguage

I’d suggest this is even quite possible today without much effort using fetch instead of modules.

1reaction

aduh95commented, Feb 17, 2021

I have a take on this, but it might be ignorant on how people are actually using highlight.js, but here it is: Recommend using ESM and stop caring about the bundled size. If we use a promise-based approach, we could use dynamic import (https://caniuse.com/es6-module-dynamic-import) to lazy load only language parsers the user is actually using and have a pretty clean API:

<link rel="stylesheet" href="/path/to/styles/default.css">
<!-- IE compatibility -->
<script src="/path/to/promise-polyfill.min.js" nomodule ></script>
<script src="/path/to/highlight.min.js" nomodule ></script>
<script nomodule >hljs.highlightAll().catch(console.error);</script>
<!-- Evergreen browsers -->
<script type="module">
import hljs from '/path/to/highlight.mjs';
hljs.highlightAll().catch(console.error);
</script>

Using this approach, only IE users would suffer from the bundle size.

We would have to make some change in the underlining API to deal with Promise and import():

import hljs from "./highlight.js/lib/core.js";

// registerLanguage should accept lazy loaded `Language`:
hljs.registerLanguage('c', () => import('./highlight.js/lib/languages/c.js'));
hljs.registerLanguage('d', () => import('./highlight.js/lib/languages/d.js'));

// the actual fetching of the language module can be triggered later by a user
// event – or they are never fetched if there are not actually needed.
someForm.addEventListener('submit', (ev) => {
  ev.preventDefault();
  // highlight now returns a Promise
  hljs.highlight(someForm.language.value, someForm.code.value)
    // the correct module is loaded only when we need it
    .then(result => {
      // update the highlighted HTML…
    })
    .catch(err => {
      // the Promise rejects if someForm.language.value doesn't correspond to a
      // registered language or an alias.
      alert('Unknown language?');
      console.error(err);
    });
});

I might be wrong, but I am under the impression that most highlight.js use a subset of the languages it provides (and sometimes only one). If we are moving to ESM on v11, I think we should do the extra work of having a Promise-based API, users would benefit from it.