Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting encoding of UTF-8 for some websites with non-UTF-8 encoding change response.text to be gibberish

See original GitHub issue

When making a request to http://www.qq.com (or other pages like it), the encoding is returned as GB-2312. Looking at the response.text with the debugger, everything looks fine - correct characters.

However, setting the response.encoding to UTF-8 causes the text to become garbled, gibberish:

I am thinking it might be that the document is actually utf-8 encoded, but the content-type and document all say GB-2312, so the decoder gets confused.

Expected Result

Setting response.encoding to ‘utf-8’ should not cause text to become gibberish.

Actual Result

Text becomes gibberish.

Reproduction Steps

import requests

response = \
    requests.get(
        'http://www.qq.com',
        headers={
            'Accept-Language': 'en-US, en;q=0.5',
            'Accept-Charset':  'utf-8',
        },
        timeout=10,
        verify=False,
    )

# If we have an error, raise an exception
response.raise_for_status()

# Looking at response.text, it looks fine right now
print(
    response.text[0:200]
)

# It is GB2312, convert to UTF-8
if response.encoding != 'utf-8':
    response.encoding = 'utf-8'

# Looking at response.text, it is garbled now
print(
    response.text[0:200]
)

Output looks like:

and

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.8.2"
  },
  "platform": {
    "release": "5.4.0-39-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010106f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": false
}

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sigmavirus24commented, Jul 2, 2020

@cdvv7788 please read the issue closely. They’re setting utf-8 if the encoding isn’t already utf8

# It is GB2312, convert to UTF-8
if response.encoding != 'utf-8':
    response.encoding = 'utf-8'

1reaction

cdvv7788commented, Jul 2, 2020

According to https://requests.readthedocs.io/en/master/user/advanced/#encodings, the encodings are extracted from the headers, and if it cannot be found, it uses chardet as a fallback. When the encoding is in the body, it seems it is just ignored. I am trying out this one: https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.encoding.html_body_declared_encoding You can figure out the encoding with that one, and then run response.encoding = 'my-body-encoding'. After that, response.text should be correct.

I think that requests should include something similar. There are many cases where the encoding is in the body.