Grapheme support
See original GitHub issueI think grapheme support needs to be discussed before going down that road.
Grapheme clusters are the unicode way to represent user perceived characters build from several consecutive unicode chars in a string. They are needed by some letter systems to draw the correct glyph - e.g. drawing them independently leads to a different glyph output than drawing them at once.
The problem with a terminal environment - the terminal typically has a grid based layout where a single cell represents a character and/or a possible cursor position with a fixed size. In a monospaced ASCII world this is perfect, every printable char moves the cursor by one cell and can be rendered at that position. It does not work that easy as soon as unicode enters the stage. The first problem are characters that are defined as full width chars used by many asian languages and combining chars used all over the place in many letter systems. xterm.js can handle this atm with a typical wcwidth
implementation and the cell based model just as good/bad as most other terminal emulators do.
Grapheme clusters are kinda the next level of the this problem - they typically join multiple cells, that wcwidth + the combining chars handling would output leading to multiple combined cells as ONE perceived character. Compared to the current fullwidth char handling:
[..., fullwidth_char, null, ...] --> 2 cells for fullwidth char
and the current combining handling:
[..., char+combining, ...] --> 1 cell for char with combining
and the current fullwidth + combining handling:
[..., fullwidth_char+combining, null, ...] --> 2 cells for fullwidth char with combining
it is not that easy with grapheme support anymore. They can be build from any combinations of the above (limited by the grapheme breaking algorithm of course), which raises several questions:
- Where adding the chars up? The current combining handling adds modifier characters to the first cell (
char+combining
) to make sure they end up together in one string and get not rendered separately (which is also a requirement for grapheme clusters). Currently this is possible because the combining chars always have a wcwidth of 0. In a grapheme cluster a following char might have a wcwidth != 0, adding those up in the first cell will break the cursor movement. How to handle the cursor here is still obscure to me. - How to deal with sum of wcwidth? In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment. A wordprocessor with variable font width does not suffer this, it can just align the stuff as needed. No clue yet, what to do here.
Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all. Maybe some code editors have implemented this before, how about iterm2?
Maybe someone could share some experience regarding this topic so we can avoid basic misconception and flaws from the beginning.
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
@Tyriar Yup, we only need the codepoint ranges and the “character type” (those Control, … entries) from that file to have all needed information. I ended up encoding the information into 256th codepoint chunks with a length and a type attribute (is about 3kB, as base64 ~4k). I would not pack it further with a zip algo since 4kB is almost neglectible for local builds, browser based builds tend to have a transport zipper anyways.
Ok I will put it into a ts file. And yes, the lookup table creation can be postponed until the first chars fly in. It can be even split into 3 major unicode ranges, one for lower than 12k, 2nd for 42k - 65k and 3rd for >65k due to very different character type distribution with slightly different lookup table layout.
@Tyriar Ah yes #701 is full of good refs to get something started. And Terminal.app does some magic, where is the code hosted again? Just kidding…
Edit: To keep things comprehensible I would not mess around with RTL for now.