question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

need help for fixing a text

See original GitHub issue

I have the following text: “WalldÃ?©n”. I know that correctly it is “Walldén” (“Walld\u00e9n”). ftfy.fix_text returns the same text:

s = "Walld�©n"                                                                                                                                                                              
print(ftfy.fix_text(s))     # prints "Walld�©n"

Do you know how to correct this text? Thanks.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rspeercommented, Oct 10, 2017

@ArtemBernatskyy That one is a mixup between Windows-1251 and Windows-1252. You’d want 'Ñòðåëà'.encode('windows-1252').decode('windows-1251').

Both of these are single-byte encodings, which makes this the hardest case to try to auto-detect in ftfy – see issue #18.

0reactions
rspeercommented, Oct 11, 2017

@jabbalaci Oh, so it’s the Unicode replacement character �, not a literal question mark. (Both of them can show up in different encoding accidents.)

Unfortunately this is still really lossy, but maybe I can help you with your approach. I can mostly reconstruct the encoding mixup that produced this data – it probably uses Latin-1 instead of the very similar Windows-1252, and then throws out the control characters. This would at least help you to generate the strings you need to replace:

def mangle_text(text):
    text = text.encode('utf-8').decode('latin-1').encode('utf-8').decode('latin-1')
    return re.sub('[\x80-\x9f]', '\ufffd', text)

For example:

>>> mangle_text('ñ')
'�±'

But you’ll probably see why I don’t think even an improvement to ftfy would be able to fix this, which is that different common characters end up with the same representation:

>>> mangle_text('Å')
'��'

>>> mangle_text('Á')
'��'
Read more comments on GitHub >

github_iconTop Results From Across the Web

Fix problems sending, receiving or connecting to Messages
If you can't send or receive messages, or have trouble connecting to Messages on web, try the following suggestions below. Fix problems sending...
Read more >
How to use Auto-Correction and predictive text on your iPhone ...
Auto-Correction uses your keyboard dictionary to spellcheck words as you type, automatically correcting misspelled words for you.
Read more >
Use a screen reader to align text and paragraphs in Word
For general help, visit Microsoft Support home or Fixes or workarounds ... Need instructions on how to align text in Word, but not...
Read more >
Messaging and Email troubleshooting | T-Mobile Support
Android devices: Make sure the SMSC setting is +12063130004. Clear the app cache for Email & Messaging. Reset APNs to default. Apple devices:...
Read more >
Apple can fix the messaging between Androids & iPhones
These problems exist because Apple refuses to adopt modern texting standards when people with iPhones and Android phones text each other. Help @Apple...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found