question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UTF-8 isn't encoded/decoded correctly

See original GitHub issue
msgpack.encode('🐦')
  • Produces: [166, 237, 160, 189, 237, 176, 166]
  • Expected: [164, 240, 159, 144, 166]

The UTF-16 surrogate pair is incorrectly encoded as two pairs of 3 byte UTF-8 codepoints instead of a single 4 byte codepoint.

It seems intentional, given this comment (buffer-lite.js:21):

// JavaScript's string uses UTF-16 surrogate pairs for characters other than BMP.
// This encodes string as CESU-8 which never reaches 4 octets per character.

I don’t see the ability to encode CESU-8 instead of UTF-8 in the msgpack spec though. This will lead to interoperability issues with other msgpack implementations at best, crashing with incorrectly decoded codepoints at worst.

I wrote a plain JavaScript UTF-8 implementation before, will make a PR when I get a moment.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
kawanetcommented, Oct 14, 2016

0.1.26 published. Thanks!

0reactions
gordomiumcommented, Sep 8, 2016

Hope it be fixed

Read more comments on GitHub >

github_iconTop Results From Across the Web

utf 8 - How to detect and fix incorrect character encoding
A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences ......
Read more >
UTF-8 encoded "From" address is not properly decoded
The UTF-8 encoded address should be correctly processed by Domino server and "From" address should look proper in Notes Client or iNotes.
Read more >
UTF-8 encoded files aren't displayed correctly in tree/file view
#891 UTF-8 encoded files aren't displayed correctly in tree/file view ... I'm assuming chardet does try to encode with utf-8 and returns no...
Read more >
What is UTF-8 Encoding? A Guide for Non-Programmers
UTF -8 encoding is preferable to UTF-16 on the majority of websites, because it uses less memory. Recall that UTF-8 encodes each ASCII...
Read more >
How can I fix the UTF-8 error when bulk uploading users?
This error is created when the uploaded file is not in a UTF-8 format. UTF-8 is the dominant character encoding format on the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found