Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Replace emoji naming system using Unicode CLDR

See original GitHub issue

The Unicode CLDR project provides a number of comprehensively-researched and well-maintained data sources for a variety of language-related topics, including names and keywords for emoji. Although our own list has served reasonably well, we don’t have the resources of the Unicode Consortium, so ours is comparatively incomplete, sporadically maintained, and English-only. I think we should aim to replace it using the CLDR.

The CLDR data model is slightly different, with more verbose canonical names (which is probably an inevitable consequence of covering more emoji), and keywords rather than aliases. This might require some UI adjustments. I propose the following:

Canonically store emoji (except custom emoji) as Unicode characters 👻 rather than colon-strings :ghost:.
Generate emoji tooltips using the CLDR canonical name in the viewer’s chosen language.
Filter using both CLDR canonical names and CLDR keywords in the emoji picker search bar and the emoji typeahead widget.
Maybe allow users to define their own emoji aliases, for cases where their favorite emoji might be unergonomic to find using the picker.
[@gnprice adds:] Map emoji from Matrix and Slack imports based on data about the emoji names used by Matrix and Slack respectively, rather than by using our own emoji names and half-heartedly making our names align with theirs. See below at https://github.com/zulip/zulip/issues/18121#issuecomment-1168026900 .

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

3reactions

anderskcommented, Apr 26, 2022

Part of what I’ve proposed above is that aliases will not be part of the data model.

Canonically store emoji (except custom emoji) as Unicode characters 👻 rather than colon-strings :ghost:.

My claim is that whatever value might exist in a system that communicates alias choices along with emoji is outweighed by its unmaintainability.

Other systems don’t treat emoji that way. An emoji is the union of all of its meanings, just like a word is the union of all of its meanings; everybody in a conversation understands that, so readers are responsible for using context to distinguish meanings and writers are responsible for providing enough context. If we insist on keeping our unique model, we’ll be forever fighting with other systems whenever we exchange data with them.

Much of the value in communicable alias choices is as a workaround for bad primary names. With CLDR, the primary names are very good; the CLDR has put a lot more effort into getting them right than we ever could.

We don’t even fully support our own alias model today. You can’t view a non-primary alias from a device without mouse hover support. You can’t react with a non-primary alias from the web. You can’t react with one alias of emoji if there’s already a reaction with a different alias of that emoji. If you type a Unicode character from your device’s native emoji picker, we treat it as if you used the primary name rather than remembering that you didn’t specify one.

And even in the best case where an alias choice is successfully communicated through a fully supported path, it’s treated as a glorified easter egg. Let’s play a puzzle game! Which meaning did I intend for this character? Oh look, you can trigger the hidden tooltip to reveal it! Isn’t this so clever and exciting?

Unicode now has 3633 emoji and growing, up from the 1051 that Zulip currently supports. We don’t have the resources to curate the thoughtful sets of names and aliases for all of them that would be needed for the current model. And we can’t pool resources with other projects because other systems don’t treat emoji that way. Is maintaining our own emojiverse for a glorified easter egg really what we want to be doing with our time?

1reaction

timabbottcommented, Apr 26, 2022

Looking at using CLDR, one problem is that it does not distinguish between things that are potential aliases for the name and things that are potential search keywords for the item. Here’s a section of body parts:

Notice how “body”, “accessibility”, “prosthetic”, etc. each appear with multiple emoji. Those terms are obviously not suitable as alternative names for the emoji – in that in our data model at least, each emoji name should map to a unique codepoint (but a given codepoint could have multiple names/aliases mapping to it). I’m pretty sure that definitely matches how users expect to see this in UI.

Given that detail, I think we could easily use CLDR as part of a revised algorithm for searching for emoji, and possibly for adjusting the primary names we use, but not as a direct source of aliases. Though it’s possible that one could process CLDR, turning duplicates into search keywords, and still end up with a useable set of aliases.

Top Results From Across the Web

Emoji Names and Keywords - Unicode CLDR

CLDR collects short character names and keywords for Emoji characters and sequences. These are found in Survey Tool under Characters, and they are...

Characters - Unicode CLDR

Characters. Characters category in the Survey tool include data that surrounds support for Emoji, Symbols, and Typography names. Character Labels.

Unicode CLDR

The Unicode Common Locale Data Repository (CLDR) provides key building blocks for software to support the world's languages, with the largest and most...

CLDR 30 Release Note - Unicode CLDR

Unicode CLDR 30 provides an update to the key building blocks for software supporting the world's languages. This data is used by all...

CLDR 33.1 - Unicode CLDR

If sequence is an emoji tag sequence, look up the subdivision name in CLDR for the corresponding ASCII characters and compose as for...