question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting encoding of UTF-8 for some websites with non-UTF-8 encoding change response.text to be gibberish

See original GitHub issue

When making a request to http://www.qq.com (or other pages like it), the encoding is returned as GB-2312. Looking at the response.text with the debugger, everything looks fine - correct characters. image

However, setting the response.encoding to UTF-8 causes the text to become garbled, gibberish: image

I am thinking it might be that the document is actually utf-8 encoded, but the content-type and document all say GB-2312, so the decoder gets confused.

Expected Result

Setting response.encoding to ‘utf-8’ should not cause text to become gibberish.

Actual Result

Text becomes gibberish.

Reproduction Steps

import requests

response = \
    requests.get(
        'http://www.qq.com',
        headers={
            'Accept-Language': 'en-US, en;q=0.5',
            'Accept-Charset':  'utf-8',
        },
        timeout=10,
        verify=False,
    )

# If we have an error, raise an exception
response.raise_for_status()

# Looking at response.text, it looks fine right now
print(
    response.text[0:200]
)

# It is GB2312, convert to UTF-8
if response.encoding != 'utf-8':
    response.encoding = 'utf-8'

# Looking at response.text, it is garbled now
print(
    response.text[0:200]
)

Output looks like: image

and image

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.8.2"
  },
  "platform": {
    "release": "5.4.0-39-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010106f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": false
}

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sigmavirus24commented, Jul 2, 2020

@cdvv7788 please read the issue closely. They’re setting utf-8 if the encoding isn’t already utf8

# It is GB2312, convert to UTF-8
if response.encoding != 'utf-8':
    response.encoding = 'utf-8'
1reaction
cdvv7788commented, Jul 2, 2020

According to https://requests.readthedocs.io/en/master/user/advanced/#encodings, the encodings are extracted from the headers, and if it cannot be found, it uses chardet as a fallback. When the encoding is in the body, it seems it is just ignored. I am trying out this one: https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.encoding.html_body_declared_encoding You can figure out the encoding with that one, and then run response.encoding = 'my-body-encoding'. After that, response.text should be correct.

I think that requests should include something similar. There are many cases where the encoding is in the body.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Text from website appears as Gibberish instead of Hebrew
I tried to encode the text to utf-8 but it still in gibberish. I tried to deocde it using utf-8 , but it...
Read more >
Encoding settings for garbled text - Google Merchant Center ...
Select "View" from the top of your browser window. Select "Text Encoding." Select "Unicode (UTF-8)" from the dropdown menu. Safari. Select " ...
Read more >
Use the UTF-8, Luke! File Encodings in IntelliJ IDEA
What is the problem with file encodings? How does the IDE determine encoding for the file? What happens when I try to change...
Read more >
Changing an HTML page to Unicode - W3C
This page will help you change the character encoding of your HTML page to UTF-8. Answer. Below we summarise the information you need...
Read more >
Trouble with Drupal, MySQL and non UTF-8 formats
On our multi-lingual, multi-site Drupal platform we were recently dealing with what looked to be formatting errors in character encoding.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found