Setting encoding of UTF-8 for some websites with non-UTF-8 encoding change response.text to be gibberish
See original GitHub issueWhen making a request to http://www.qq.com (or other pages like it), the encoding is returned as GB-2312. Looking at the response.text with the debugger, everything looks fine - correct characters.
However, setting the response.encoding to UTF-8 causes the text to become garbled, gibberish:
I am thinking it might be that the document is actually utf-8 encoded, but the content-type and document all say GB-2312, so the decoder gets confused.
Expected Result
Setting response.encoding to ‘utf-8’ should not cause text to become gibberish.
Actual Result
Text becomes gibberish.
Reproduction Steps
import requests
response = \
requests.get(
'http://www.qq.com',
headers={
'Accept-Language': 'en-US, en;q=0.5',
'Accept-Charset': 'utf-8',
},
timeout=10,
verify=False,
)
# If we have an error, raise an exception
response.raise_for_status()
# Looking at response.text, it looks fine right now
print(
response.text[0:200]
)
# It is GB2312, convert to UTF-8
if response.encoding != 'utf-8':
response.encoding = 'utf-8'
# Looking at response.text, it is garbled now
print(
response.text[0:200]
)
Output looks like:
and
System Information
$ python -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": ""
},
"idna": {
"version": "2.8"
},
"implementation": {
"name": "CPython",
"version": "3.8.2"
},
"platform": {
"release": "5.4.0-39-generic",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "",
"version": null
},
"requests": {
"version": "2.22.0"
},
"system_ssl": {
"version": "1010106f"
},
"urllib3": {
"version": "1.25.8"
},
"using_pyopenssl": false
}
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Text from website appears as Gibberish instead of Hebrew
I tried to encode the text to utf-8 but it still in gibberish. I tried to deocde it using utf-8 , but it...
Read more >Encoding settings for garbled text - Google Merchant Center ...
Select "View" from the top of your browser window. Select "Text Encoding." Select "Unicode (UTF-8)" from the dropdown menu. Safari. Select " ...
Read more >Use the UTF-8, Luke! File Encodings in IntelliJ IDEA
What is the problem with file encodings? How does the IDE determine encoding for the file? What happens when I try to change...
Read more >Changing an HTML page to Unicode - W3C
This page will help you change the character encoding of your HTML page to UTF-8. Answer. Below we summarise the information you need...
Read more >Trouble with Drupal, MySQL and non UTF-8 formats
On our multi-lingual, multi-site Drupal platform we were recently dealing with what looked to be formatting errors in character encoding.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@cdvv7788 please read the issue closely. They’re setting
utf-8
if the encoding isn’t already utf8According to https://requests.readthedocs.io/en/master/user/advanced/#encodings, the encodings are extracted from the headers, and if it cannot be found, it uses
chardet
as a fallback. When the encoding is in the body, it seems it is just ignored. I am trying out this one: https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.encoding.html_body_declared_encoding You can figure out the encoding with that one, and then runresponse.encoding = 'my-body-encoding'
. After that,response.text
should be correct.I think that requests should include something similar. There are many cases where the encoding is in the body.