Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

need help for fixing a text

See original GitHub issue

I have the following text: “WalldÃ?Â©n”. I know that correctly it is “Walldén” (“Walld\u00e9n”). ftfy.fix_text returns the same text:

s = "WalldÃ?Â©n"                                                                                                                                                                              
print(ftfy.fix_text(s))     # prints "WalldÃ?Â©n"

Do you know how to correct this text? Thanks.

Issue Analytics

State:
Created 6 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

rspeercommented, Oct 10, 2017

@ArtemBernatskyy That one is a mixup between Windows-1251 and Windows-1252. You’d want 'Ñòðåëà'.encode('windows-1252').decode('windows-1251').

Both of these are single-byte encodings, which makes this the hardest case to try to auto-detect in ftfy – see issue #18.

0reactions

rspeercommented, Oct 11, 2017

@jabbalaci Oh, so it’s the Unicode replacement character �, not a literal question mark. (Both of them can show up in different encoding accidents.)

Unfortunately this is still really lossy, but maybe I can help you with your approach. I can mostly reconstruct the encoding mixup that produced this data – it probably uses Latin-1 instead of the very similar Windows-1252, and then throws out the control characters. This would at least help you to generate the strings you need to replace:

def mangle_text(text):
    text = text.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
    return re.sub('[\x80-\x9f]', '\ufffd', text)

For example:

>>> mangle_text('ñ')
'Ã�Â±'

But you’ll probably see why I don’t think even an improvement to ftfy would be able to fix this, which is that different common characters end up with the same representation:

>>> mangle_text('Å')
'Ã�Â�'

>>> mangle_text('Á')
'Ã�Â�'

Top Results From Across the Web

Fix problems sending, receiving or connecting to Messages

If you can't send or receive messages, or have trouble connecting to Messages on web, try the following suggestions below. Fix problems sending...

How to use Auto-Correction and predictive text on your iPhone ...

Auto-Correction uses your keyboard dictionary to spellcheck words as you type, automatically correcting misspelled words for you.

Use a screen reader to align text and paragraphs in Word

For general help, visit Microsoft Support home or Fixes or workarounds ... Need instructions on how to align text in Word, but not...

Messaging and Email troubleshooting | T-Mobile Support

Android devices: Make sure the SMSC setting is +12063130004. Clear the app cache for Email & Messaging. Reset APNs to default. Apple devices:...

Apple can fix the messaging between Androids & iPhones

These problems exist because Apple refuses to adopt modern texting standards when people with iPhones and Android phones text each other. Help @Apple...