question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot resolve mojibake on emoji

See original GitHub issue

We’re getting JSON from a third party

{
  "title": "\\u011f\\u0178\\ufffd\\u0161 Cooked Rice"
}

This is supposed to correspond to the Cooked Rice emoji: 🍚 https://www.compart.com/en/unicode/U+1F35A

They also send us

{
  "first_name": "Ya\\u00c3\\u00abl"
}

Which could be cleaned by simply going over latin1.

print('Ya\u00c3\u00abl')
'Yaël'
print('Ya\u00c3\u00abl'.encode('latin1').decode('utf-8'))
'Yaël'

Any help here would be amazing.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rspeercommented, Feb 18, 2021

Unfortunately, ftfy isn’t going to be able to automatically fix either of these.

The second example is too short to be reliably detected. If you know that latin1 / utf-8 mixups are likely, you should probably just try that encoding and decoding yourself.

The first one makes a fascinating example, because it’s a Windows-1254 / utf-8 mixup, and I hadn’t seen one of those in the wild before. There are two problems now:

  • ftfy has never tried to recognize Windows-1254 mojibake
  • the third character was an unassigned byte in Windows-1254 that has been replaced by ‘�’, so if we did look for Windows-1254 mojibake patterns, the intended text could still have been any of these characters:
U+1F05A  🁚       [So] DOMINO TILE HORIZONTAL-05-06
U+1F35A  🍚      [So] COOKED RICE
U+1F39A  🎚       [So] LEVEL SLIDER
U+1F3DA  🏚       [So] DERELICT HOUSE BUILDING
U+1F41A  🐚      [So] SPIRAL SHELL
U+1F75A  🝚       [So] ALCHEMICAL SYMBOL FOR POWDERED BRICK
U+1F79A  🞚       [So] WHITE DIAMOND CONTAINING BLACK VERY SMALL DIAMOND
0reactions
rspeercommented, Feb 22, 2021

Thanks for the details. That’s terrifying – I had no idea that the default behavior of requests on a text file is to blindly accept whatever chardet says.

Even though requests provides a better path to decoding JSON, I’d still say that the issue is really on requests’ end, for silently introducing encoding bugs like that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to decode this broken emoji string(mojibake) with python ...
Even tried ftfy and other solutions but it doesn't work at all. You can't. The '🠃🠙' string is result of mojibake...
Read more >
Emoji shown as mojibake · Issue #6058 - GitHub
If I simply choose a theme with emoji and open a new window. The style is right but "assets://" css file improperly decoded....
Read more >
New Frontiers in Mojibake - LessWrong
Mojibake is the garbled text that result from character-encoding errors. If you’ve seen text that looks like this — and I’m sure you ......
Read more >
Why can't you reverse a string with a flag emoji? - Hacker News
Handling unicode can be fine, depending on what you're doing. The hard parts are: - Counting, rendering and collapsing grapheme clusters (like the...
Read more >
An Easy Guide to Solve the Japanese Character Display ...
These symbols are called 'Mojibake,' and they appear when the system cannot recognize the input characters and replace the invalid ones with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found