question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenization overhaul

See original GitHub issue

The current tokenisation story of VS Code is based on TM grammars, which are pretty powerful, but we are running into their limits if we want to do something more than a top-down scanner can do. Also, once you implement a TM interpreter, you realise how inefficient the way in which regular expressions must be evaluated is and how TM grammars were not meant to do much more than simple colouring using just a few rules… The fact that we now have these complex grammars than end up producing beautiful tokens is more of a testament to the amazing computing power available to us than to the design of the TM grammar semantics.

At the time when TM grammars were introduced and became popular there were no language servers available which understand the semantics of a language. Therefore, TM grammars were also used to colour semantic constructs. The introduction of LSP has brought us language servers for many languages and it we want to leverage this power to reduce the complexity of the tokenizer/classifier. There is already effort under way to specify how such API might look under LSP at https://github.com/microsoft/vscode-languageserver-node/pull/367

In any case, for smarter languages where we offer great extensions, such as for TypeScript or C++, we have noticed two different patterns emerge to try and compensate for these limitations.


Complex TM grammar approach (TypeScript)

This approach was taken by TypeScript, where we now have immense regular expressions, which are a testament to the smartness of the author, but which are potentially very slow to evaluate on the UI thread: image


Text Editor Decorations (C++)

This approach was taken by C++, where we now receive potentially unbounded amounts of text editor decorations used to represent semantic tokens which are pushed by the C++ extension to correct or enhance the TM grammar. The limits of text editor decorations start to show, I have collected some of the issues under this query. Due to their memory cost, complexity, and breadth of usage (i.e. cannot touch the editing logic around them at this point), text editor decorations are not the right tool for this job…


Both approaches show that there is a real need for something more, and that folks which care can get really creative and smart in tackling this need even when we lack as a platform. This issue is about overhauling how tokenization works in VS Code and tries to address multiple goals at once:

1. Move tokenization off the UI thread

Today, TM tokenization runs on the UI thread. Even more interesting, we have numerous features (such as typing a } or typing (, ', ", etc) where we need to know synchronously, at the time we interpret the typed character if we are in a comment, in a string, in a regex, or somewhere else… So we have code paths were we end up tokenizing the current line synchronously given the line above is tokenized in order to find out what’s the exact context that we are in and then we make a decision based on that.

We have looked into this and built a prototype where we removed the synchronous tokenization… Moving this kind of classification off the UI thread entirely would result in severe flakiness… In other words, sometimes pressing ' would insert '|' and sometimes only '|, in the same file, in the same location, based purely on typing speed and the time it takes to send tokens over from the web worker. Having an editor where typing something does one thing 90% of the time and another thing 10% of the time based on typing speed would IMHO be completely unacceptable.

As a one-off approach, we have written a fast classifier for comments, strings or regexes for TS, in TS. We will experiment to see if this classifier could be used synchronously on the UI thread to determine what to do when typing these characters (}, ', etc). The challenge here lies with making it incremental (not start from the beginning of the file for each keystroke). Also, since these syntax constructs are “rare” relative to the entire body of text, a line based representation would not be a good one. Even more ideas are that perhaps we shouldn’t store the location of strings, comments, etc. but only the save-points between them given the classifier would be fast enough to compute them again.

Another idea circulating was to enable writing monarch classifiers and contributing them from extensions. This would avoid some of the bad design choices of TM, but would still mean evaluating regexes written by extensions on the UI thread. Yet another idea was to have a “fast” TM grammar that only deals with strings, comments and regexes and another normal one for tokens – again with the same problem of running regexes written by extension on the UI thread. Another idea was to build some base parser, with components such as C-style comments, C-style strings, etc which could be exercised by extensions (i.e. some kind of higher-order constructs than regexes). Or maybe we should just hand write parsers for the top 90% of our languages to detect strings, comments and regexes… We have not yet taken any clear decision as we still need to experiment in this area to learn more…

2. Accept tokens from the extension host (semantic tokenization)

Moving TM grammars off the UI thread is good for reducing our freezes and crashes, but still does not address the fundamental limitations of TM. Here we need to add API such that semantic tokens can be pushed by the extension host. These tokens should very much behave similar to text editor decorations, but have a different implementation where we can represent them with a lot less memory (just 2 or 3 32bit numbers like we do with the other tokens). We should also tweak the way they are adjusted around typing to make most sense for tokens…

They also need to be updateable incrementally and only as needed. There are discussions of using the visible ranges APIs to prioritize the regions which should receive semantic tokens first. We have not yet began drafting an API nor a reference implementation.

3. (low priority) Enable the integration of other tokenization engines

This is just here to remind us to keep in mind that we might want to move away from the inefficient TM grammars completely at one point in the future. There is a lot of love for Tree-Sitter nowadays and we might want to investigate using it, or we might want to roll our own story, since we do actually have a lot of experience in this area…


Tasks

  • write a fast TS classifier of comments, strings, regex (done here with tests here )
  • integrate the fast TS classifier and use it for synchronous classification instead of the TM engine
    • figure out how to manage checkpoints and produce classifications incrementally
  • be able to run TM grammars on a web worker in the rich client
  • be able to run TM grammars on a web worker in the web ui
  • move the TS TM grammar on a web worker and send, in batches, tokens.
  • move the TS TM grammar on a web worker and implement greedy viewport tokenization on it.
  • once we have async TM tokens, it is not that big of a leap to have async semantic tokens, so explore pushing tokens from the extension host:
    • should we write another tokens store that is not line based since these tokens should be more “rare”?
    • how do we manage two tokens stores, one owned by TM, and one owned by the extension host, how do we merge them to give a consistent picture to the editor view?
    • having TM running in a web worker works because we own both the UI side and the worker side of things, so we know to remember the last N edits, until they get confirmed by the web worker, and the web worker knows which lines to retokenize given it trusts the UI side to update tokens around editing in certain ways. How should we spec this? We need to spec how the editor adjusts tokens when editing and how we expect that the extension host pushes new tokens in the edited areas…

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:136
  • Comments:21 (8 by maintainers)

github_iconTop GitHub Comments

19reactions
fwcdcommented, Aug 16, 2019

Tree-Sitter support would be amazing! It does have a fast parser and grammars for many popular languages (including syntax-highlighters that map Tree-Sitter AST nodes to TextMate scopes).

17reactions
mattacostacommented, Mar 30, 2020

For comparison, I benchmarked a few engines:

Tokenization performance (PHP) First tokenization run only. Does not include V8 optimizations from subsequent runs (like after editing a file).

File: PHPMailer.php Average of 10 iterations. All times in milliseconds (lower is better).

| WASM | JS | TreeSitter | TextMate – | – | – | – | – Average | 24.95 | 38.73 | 78.24 | 795.75 Min | 23.58 | 36.48 | 75.81 | 789.64 Max | 26.84 | 41.59 | 80.21 | 809.50 Standard Dev | 1.01 | 1.57 | 1.39 | 6.07 Tokens | 25921 | 25921 | 17559 | 30607

There are a number of implementation details to consider, but even the slowest non-TM engine is ~10x faster. Instead of creating a fast tokenizer for each language, I think it would be more efficient to create an API for other engines and immediately get the benefit of whatever is available. This doesn’t even factor in that the non-TM implementations all support incremental changes as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenization Repair in the Presence of Spelling Errors - GitHub
This software attempts to solve the following Tokenization Repair problem: Given a text with missing and spurious spaces, correct those.
Read more >
Tokenization Repair in the Presence of Spelling Errors - arXiv
We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct ...
Read more >
A Look at How Asset Tokenization Is Revolutionizing the ...
Asset tokenization is set to completely change the industry's competitive landscape, overhauling a system that has known few changes in decades.
Read more >
What is tokenized real estate? A beginner's guide to digital ...
Tokenization refers to the virtual fragmentation of assets into tradable pieces. Discover tokenization and digital real estate ownership.
Read more >
EMV Tokenization Services for Payment - Thales
Credit card tokenization. EMV Tokenization converts sensitive cardholder information into a unique token or digital identifier that can be securely deployed ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found