Replace emoji naming system using Unicode CLDR
See original GitHub issueThe Unicode CLDR project provides a number of comprehensively-researched and well-maintained data sources for a variety of language-related topics, including names and keywords for emoji. Although our own list has served reasonably well, we don’t have the resources of the Unicode Consortium, so ours is comparatively incomplete, sporadically maintained, and English-only. I think we should aim to replace it using the CLDR.
The CLDR data model is slightly different, with more verbose canonical names (which is probably an inevitable consequence of covering more emoji), and keywords rather than aliases. This might require some UI adjustments. I propose the following:
- Canonically store emoji (except custom emoji) as Unicode characters
👻rather than colon-strings:ghost:. - Generate emoji tooltips using the CLDR canonical name in the viewer’s chosen language.
- Filter using both CLDR canonical names and CLDR keywords in the emoji picker search bar and the emoji typeahead widget.
- Maybe allow users to define their own emoji aliases, for cases where their favorite emoji might be unergonomic to find using the picker.
- [@gnprice adds:] Map emoji from Matrix and Slack imports based on data about the emoji names used by Matrix and Slack respectively, rather than by using our own emoji names and half-heartedly making our names align with theirs. See below at https://github.com/zulip/zulip/issues/18121#issuecomment-1168026900 .
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Emoji Names and Keywords - Unicode CLDR
CLDR collects short character names and keywords for Emoji characters and sequences. These are found in Survey Tool under Characters, and they are...
Read more >Characters - Unicode CLDR
Characters. Characters category in the Survey tool include data that surrounds support for Emoji, Symbols, and Typography names. Character Labels.
Read more >Unicode CLDR
The Unicode Common Locale Data Repository (CLDR) provides key building blocks for software to support the world's languages, with the largest and most...
Read more >CLDR 30 Release Note - Unicode CLDR
Unicode CLDR 30 provides an update to the key building blocks for software supporting the world's languages. This data is used by all...
Read more >CLDR 33.1 - Unicode CLDR
If sequence is an emoji tag sequence, look up the subdivision name in CLDR for the corresponding ASCII characters and compose as for...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Part of what I’ve proposed above is that aliases will not be part of the data model.
My claim is that whatever value might exist in a system that communicates alias choices along with emoji is outweighed by its unmaintainability.
Other systems don’t treat emoji that way. An emoji is the union of all of its meanings, just like a word is the union of all of its meanings; everybody in a conversation understands that, so readers are responsible for using context to distinguish meanings and writers are responsible for providing enough context. If we insist on keeping our unique model, we’ll be forever fighting with other systems whenever we exchange data with them.
Much of the value in communicable alias choices is as a workaround for bad primary names. With CLDR, the primary names are very good; the CLDR has put a lot more effort into getting them right than we ever could.
We don’t even fully support our own alias model today. You can’t view a non-primary alias from a device without mouse hover support. You can’t react with a non-primary alias from the web. You can’t react with one alias of emoji if there’s already a reaction with a different alias of that emoji. If you type a Unicode character from your device’s native emoji picker, we treat it as if you used the primary name rather than remembering that you didn’t specify one.
And even in the best case where an alias choice is successfully communicated through a fully supported path, it’s treated as a glorified easter egg. Let’s play a puzzle game! Which meaning did I intend for this character? Oh look, you can trigger the hidden tooltip to reveal it! Isn’t this so clever and exciting?
Unicode now has 3633 emoji and growing, up from the 1051 that Zulip currently supports. We don’t have the resources to curate the thoughtful sets of names and aliases for all of them that would be needed for the current model. And we can’t pool resources with other projects because other systems don’t treat emoji that way. Is maintaining our own emojiverse for a glorified easter egg really what we want to be doing with our time?
Looking at using CLDR, one problem is that it does not distinguish between things that are potential aliases for the name and things that are potential search keywords for the item. Here’s a section of body parts:
Notice how “body”, “accessibility”, “prosthetic”, etc. each appear with multiple emoji. Those terms are obviously not suitable as alternative names for the emoji – in that in our data model at least, each emoji name should map to a unique codepoint (but a given codepoint could have multiple names/aliases mapping to it). I’m pretty sure that definitely matches how users expect to see this in UI.
Given that detail, I think we could easily use CLDR as part of a revised algorithm for searching for emoji, and possibly for adjusting the primary names we use, but not as a direct source of aliases. Though it’s possible that one could process CLDR, turning duplicates into search keywords, and still end up with a useable set of aliases.