need help for fixing a text
See original GitHub issueI have the following text: “WalldÃ?©n”. I know that correctly it is “Walldén” (“Walld\u00e9n”). ftfy.fix_text
returns the same text:
s = "Walld�©n"
print(ftfy.fix_text(s)) # prints "Walld�©n"
Do you know how to correct this text? Thanks.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Fix problems sending, receiving or connecting to Messages
If you can't send or receive messages, or have trouble connecting to Messages on web, try the following suggestions below. Fix problems sending...
Read more >How to use Auto-Correction and predictive text on your iPhone ...
Auto-Correction uses your keyboard dictionary to spellcheck words as you type, automatically correcting misspelled words for you.
Read more >Use a screen reader to align text and paragraphs in Word
For general help, visit Microsoft Support home or Fixes or workarounds ... Need instructions on how to align text in Word, but not...
Read more >Messaging and Email troubleshooting | T-Mobile Support
Android devices: Make sure the SMSC setting is +12063130004. Clear the app cache for Email & Messaging. Reset APNs to default. Apple devices:...
Read more >Apple can fix the messaging between Androids & iPhones
These problems exist because Apple refuses to adopt modern texting standards when people with iPhones and Android phones text each other. Help @Apple...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@ArtemBernatskyy That one is a mixup between Windows-1251 and Windows-1252. You’d want
'Ñòðåëà'.encode('windows-1252').decode('windows-1251')
.Both of these are single-byte encodings, which makes this the hardest case to try to auto-detect in ftfy – see issue #18.
@jabbalaci Oh, so it’s the Unicode replacement character �, not a literal question mark. (Both of them can show up in different encoding accidents.)
Unfortunately this is still really lossy, but maybe I can help you with your approach. I can mostly reconstruct the encoding mixup that produced this data – it probably uses Latin-1 instead of the very similar Windows-1252, and then throws out the control characters. This would at least help you to generate the strings you need to replace:
For example:
But you’ll probably see why I don’t think even an improvement to ftfy would be able to fix this, which is that different common characters end up with the same representation: