Cannot resolve mojibake on emoji
See original GitHub issueWe’re getting JSON from a third party
{
"title": "\\u011f\\u0178\\ufffd\\u0161 Cooked Rice"
}
This is supposed to correspond to the Cooked Rice emoji: 🍚 https://www.compart.com/en/unicode/U+1F35A
They also send us
{
"first_name": "Ya\\u00c3\\u00abl"
}
Which could be cleaned by simply going over latin1.
print('Ya\u00c3\u00abl')
'Yaël'
print('Ya\u00c3\u00abl'.encode('latin1').decode('utf-8'))
'Yaël'
Any help here would be amazing.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
How to decode this broken emoji string(mojibake) with python ...
Even tried ftfy and other solutions but it doesn't work at all. You can't. The '🠃🠙' string is result of mojibake...
Read more >Emoji shown as mojibake · Issue #6058 - GitHub
If I simply choose a theme with emoji and open a new window. The style is right but "assets://" css file improperly decoded....
Read more >New Frontiers in Mojibake - LessWrong
Mojibake is the garbled text that result from character-encoding errors. If you’ve seen text that looks like this — and I’m sure you ......
Read more >Why can't you reverse a string with a flag emoji? - Hacker News
Handling unicode can be fine, depending on what you're doing. The hard parts are: - Counting, rendering and collapsing grapheme clusters (like the...
Read more >An Easy Guide to Solve the Japanese Character Display ...
These symbols are called 'Mojibake,' and they appear when the system cannot recognize the input characters and replace the invalid ones with ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Unfortunately, ftfy isn’t going to be able to automatically fix either of these.
The second example is too short to be reliably detected. If you know that latin1 / utf-8 mixups are likely, you should probably just try that encoding and decoding yourself.
The first one makes a fascinating example, because it’s a Windows-1254 / utf-8 mixup, and I hadn’t seen one of those in the wild before. There are two problems now:
Thanks for the details. That’s terrifying – I had no idea that the default behavior of
requests
on a text file is to blindly accept whatever chardet says.Even though requests provides a better path to decoding JSON, I’d still say that the issue is really on requests’ end, for silently introducing encoding bugs like that.