Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Grapheme support

See original GitHub issue

I think grapheme support needs to be discussed before going down that road.

Grapheme clusters are the unicode way to represent user perceived characters build from several consecutive unicode chars in a string. They are needed by some letter systems to draw the correct glyph - e.g. drawing them independently leads to a different glyph output than drawing them at once.

The problem with a terminal environment - the terminal typically has a grid based layout where a single cell represents a character and/or a possible cursor position with a fixed size. In a monospaced ASCII world this is perfect, every printable char moves the cursor by one cell and can be rendered at that position. It does not work that easy as soon as unicode enters the stage. The first problem are characters that are defined as full width chars used by many asian languages and combining chars used all over the place in many letter systems. xterm.js can handle this atm with a typical wcwidth implementation and the cell based model just as good/bad as most other terminal emulators do.

Grapheme clusters are kinda the next level of the this problem - they typically join multiple cells, that wcwidth + the combining chars handling would output leading to multiple combined cells as ONE perceived character. Compared to the current fullwidth char handling:

[..., fullwidth_char, null, ...]  --> 2 cells for fullwidth char

and the current combining handling:

[..., char+combining, ...] --> 1 cell for char with combining

and the current fullwidth + combining handling:

[..., fullwidth_char+combining, null, ...]  --> 2 cells for fullwidth char with combining

it is not that easy with grapheme support anymore. They can be build from any combinations of the above (limited by the grapheme breaking algorithm of course), which raises several questions:

Where adding the chars up? The current combining handling adds modifier characters to the first cell (char+combining) to make sure they end up together in one string and get not rendered separately (which is also a requirement for grapheme clusters). Currently this is possible because the combining chars always have a wcwidth of 0. In a grapheme cluster a following char might have a wcwidth != 0, adding those up in the first cell will break the cursor movement. How to handle the cursor here is still obscure to me.
How to deal with sum of wcwidth? In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment. A wordprocessor with variable font width does not suffer this, it can just align the stuff as needed. No clue yet, what to do here.

Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all. Maybe some code editors have implemented this before, how about iterm2?

Maybe someone could share some experience regarding this topic so we can avoid basic misconception and flaws from the beginning.

Issue Analytics

State:
Created 5 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

jerchcommented, May 29, 2018

@Tyriar Yup, we only need the codepoint ranges and the “character type” (those Control, … entries) from that file to have all needed information. I ended up encoding the information into 256th codepoint chunks with a length and a type attribute (is about 3kB, as base64 ~4k). I would not pack it further with a zip algo since 4kB is almost neglectible for local builds, browser based builds tend to have a transport zipper anyways.

I don’t think it’s typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow?

Ok I will put it into a ts file. And yes, the lookup table creation can be postponed until the first chars fly in. It can be even split into 3 major unicode ranges, one for lower than 12k, 2nd for 42k - 65k and 3rd for >65k due to very different character type distribution with slightly different lookup table layout.

1reaction

jerchcommented, May 23, 2018

@Tyriar Ah yes #701 is full of good refs to get something started. And Terminal.app does some magic, where is the code hosted again? Just kidding…

Edit: To keep things comprehensible I would not mess around with RTL for now.

Top Results From Across the Web

Phoneme/Grapheme Support cards

These phoneme/grapheme support cards were created by CDE to provide teachers easily accessible visual support tools for the initial introduction of phonemes ...

A Support Guide to Phase 5 Grapheme Pronunciation

We hope you enjoyed learning with this Support Guide to Phase 5 Grapheme Pronunciation. Visit the Twinkl website to download these resources ...

What is a grapheme? - TheSchoolRun

A grapheme is a written symbol that represents a sound (phoneme). This can be a single letter, or could be a sequence of...

What is a Grapheme? | Answered | ELA Teaching Resources

We have a wide variety of phonics resources that will help you teach students how to decode and encode words using graphemes, digraphs,...

Phoneme-Grapheme Correspondences - UF Literacy Institute

When teaching phoneme-grapheme correspondences, it's important to model a pronunciation of each sound that can be used when blending sounds to make words....