Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider UTF-8?

See original GitHub issue

We are using UTF-16, which has some cross platform compatibility issues, starting with different type used on different platforms - wchar_t on Windows and char16_t everywhere else, with former being a fundamental type (not requiring an include), but latter - defined as an unsigned short. This is thanks to 'nix toolchains picking UTF-32 (which is excessive for most uses) as default Unicode type.

UTF-8 is more cross platform, and supported in Windows since Win10. For prior versions of Windows it should be possible to convert to UTF-16 for things that require it. A quick search shows that V8 might use UTF-8.

I am not sure whether or not UTF-8 would be easier for our text-related projects, like RegEx support. @rhuanjl do you have any thoughts?

Issue Analytics

State:
Created 3 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

rhuanjlcommented, Jan 11, 2021

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

IMO this is a really strong argument for it

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn’t target the initial motivation of this discussion (getting rid of the wchar type)

0reactions

ljharbcommented, Jan 12, 2021

Also, I’m very skeptical that the very high usage of emoji in vernacular discussion hasn’t leaked into JS strings.

Top Results From Across the Web

Django MySQL 'utf8' is currently an alias for the character set ...

UTF-8 is what the world outside MySQL calls the Unicode encoding for any number of bytes. utf8 (no dash) is a CHARACTER SET...

1.9.3 The utf8 Character Set (Alias for utf8mb3)

To avoid ambiguity about the meaning of utf8 , consider specifying utf8mb4 explicitly for character set references. PREV HOME UP NEXT. Related Documentation....

What is UTF-8 Encoding? A Guide for Non-Programmers

We'll learn the basics of text storage and encoding, and discuss how it helps put engaging words across your site. Before we begin,...

3719: 'utf8' is currently an alias for the character set UTF8MB3 ...

3719: 'utf8' is currently an alias for the character set UTF8MB3, which will be replaced by ... Please consider using UTF8MB4 in order...

UTF-8 - Wikipedia

UTF-8 is a variable-length character encoding used for electronic communication. ... UTF-8 ... Consider the encoding of the euro sign, €:.