Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests
See original GitHub issueThis issue is about sending an HTTP request, and not about the response.
headers = { 'Content-Type': 'application/json; charset=utf-8' }
response = requests.post(url, headers=headers, data=my_native_python3_utf8_string)
Expected Result
I expected the request to be sent as UTF-8, after all, everything in the Python 3 ecosystem uses UTF-8 by default.
Actual Result
My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client
is doing it:
- In my opinion, RFC 2616 Section 3.7.1 does not apply here.
- The whole Python 3 ecosystem uses UTF-8 by default, so I find it utterly confusing that a library silently re-encodes a UTF-8 string as ISO-8859-1.
Suggestion
I suggest that if requests
takes a native UTF-8 string as data
argument, it makes sure that it is handled as a UTF-8 string all the way, and won’t be silently re-encoded as ISO-8859-1. It can cause really obscure bugs otherwise.
Related
This issue is about sending a request. Related open issues are mostly concerned with the response:
Reproduction Steps
See above.
System Information
$ python -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": "2.8"
},
"idna": {
"version": "2.8"
},
"implementation": {
"name": "CPython",
"version": "3.7.4"
},
"platform": {
"release": "4.15.0-107-generic",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010104f",
"version": "19.1.0"
},
"requests": {
"version": "2.22.0"
},
"system_ssl": {
"version": "1010107f"
},
"urllib3": {
"version": "1.25.8"
},
"using_pyopenssl": true
}
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:6 (2 by maintainers)
Top Results From Across the Web
utf 8 - python requests.get() returns improperly decoded text ...
text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may...
Read more >R: Character Encodings and 'stringi' - index.utf8.md
Character vector output with print , cat etc. silently reencodes each string so that it can be properly shown e.g. in the R's...
Read more >UTF-8: The Secret of Character Encoding - HTML Purifier
The Encoder now transforms the text back from UTF-8 to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it will be...
Read more >Unicode & Character Encodings in Python: A Painless Guide
In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...
Read more >character encodings in Perl - Perldoc Browser
Perl is widely used to manipulate data of many types: not only strings of ... For example, to convert ISO-8859-1 data into Microsoft's...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@sethmlarson
Yes, I have figured that out the hard way.
The problem is that it is incredibly easy to forget an
.encode()
call (accidentally pass the string unencoded). To add salt to the wound, it will even “work” with ASCII characters and you won’t even notice you have a bug (you forgot an.encode()
) until you get a more exotic character in the string. Ouch.In other words, I find the current interface a loaded gun, and you can easily shoot yourself in the foot.
OK, I understand.
@sethmlarson +1s are great, I really appreciate it, but what should I do so that this issue is eventually addressed?
If this issue remains closed, it will be forgotten.
A “fix” could be as simple as throwing
TypeError
if thedata
argument requires encoding (a string for example). I don’t think there is a fix that is not a breaking change.