question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

We are using UTF-16, which has some cross platform compatibility issues, starting with different type used on different platforms - wchar_t on Windows and char16_t everywhere else, with former being a fundamental type (not requiring an include), but latter - defined as an unsigned short. This is thanks to 'nix toolchains picking UTF-32 (which is excessive for most uses) as default Unicode type.

UTF-8 is more cross platform, and supported in Windows since Win10. For prior versions of Windows it should be possible to convert to UTF-16 for things that require it. A quick search shows that V8 might use UTF-8.

I am not sure whether or not UTF-8 would be easier for our text-related projects, like RegEx support. @rhuanjl do you have any thoughts?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rhuanjlcommented, Jan 11, 2021

Moving to using UTF-8 internally would significantly reduce memory usage in many scenarios, just because you end up saving 50% of memory usage for basic latin characters, which work out to 99.99% of all strings in js.

IMO this is a really strong argument for it

How to ensure that the observable behaviour matches UTF16 for points defined in the spec is the key challenge though.

One option (if memory efficiency is the aim) would be to stick to 8 bit chars for ascii characters and use utf16 strings whenever non-ascii was needed - though this would be a complex change AND wouldn’t target the initial motivation of this discussion (getting rid of the wchar type)

0reactions
ljharbcommented, Jan 12, 2021

Also, I’m very skeptical that the very high usage of emoji in vernacular discussion hasn’t leaked into JS strings.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Django MySQL 'utf8' is currently an alias for the character set ...
UTF-8 is what the world outside MySQL calls the Unicode encoding for any number of bytes. utf8 (no dash) is a CHARACTER SET...
Read more >
1.9.3 The utf8 Character Set (Alias for utf8mb3)
To avoid ambiguity about the meaning of utf8 , consider specifying utf8mb4 explicitly for character set references. PREV HOME UP NEXT. Related Documentation....
Read more >
What is UTF-8 Encoding? A Guide for Non-Programmers
We'll learn the basics of text storage and encoding, and discuss how it helps put engaging words across your site. Before we begin,...
Read more >
3719: 'utf8' is currently an alias for the character set UTF8MB3 ...
3719: 'utf8' is currently an alias for the character set UTF8MB3, which will be replaced by ... Please consider using UTF8MB4 in order...
Read more >
UTF-8 - Wikipedia
UTF-8 is a variable-length character encoding used for electronic communication. ... UTF-8 ... Consider the encoding of the euro sign, €:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found