question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests

See original GitHub issue

This issue is about sending an HTTP request, and not about the response.

    headers = { 'Content-Type': 'application/json; charset=utf-8' }
    response = requests.post(url, headers=headers, data=my_native_python3_utf8_string)

Expected Result

I expected the request to be sent as UTF-8, after all, everything in the Python 3 ecosystem uses UTF-8 by default.

Actual Result

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

https://github.com/python/cpython/blob/c3dd7e45cc5d36bbe2295c2840faabb5c75d83e4/Lib/http/client.py#L1312

  • In my opinion, RFC 2616 Section 3.7.1 does not apply here.
  • The whole Python 3 ecosystem uses UTF-8 by default, so I find it utterly confusing that a library silently re-encodes a UTF-8 string as ISO-8859-1.

Suggestion

I suggest that if requests takes a native UTF-8 string as data argument, it makes sure that it is handled as a UTF-8 string all the way, and won’t be silently re-encoded as ISO-8859-1. It can cause really obscure bugs otherwise.

Related

This issue is about sending a request. Related open issues are mostly concerned with the response:

Reproduction Steps

See above.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.8"
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.4"
  },
  "platform": {
    "release": "4.15.0-107-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010104f",
    "version": "19.1.0"
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": true
}

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
baharevcommented, Oct 29, 2020

@sethmlarson

It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

Yes, I have figured that out the hard way.

The problem is that it is incredibly easy to forget an .encode() call (accidentally pass the string unencoded). To add salt to the wound, it will even “work” with ASCII characters and you won’t even notice you have a bug (you forgot an .encode()) until you get a more exotic character in the string. Ouch.

In other words, I find the current interface a loaded gun, and you can easily shoot yourself in the foot.

Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility.

OK, I understand.

0reactions
baharevcommented, Oct 29, 2020

@sethmlarson +1s are great, I really appreciate it, but what should I do so that this issue is eventually addressed?

If this issue remains closed, it will be forgotten.

A “fix” could be as simple as throwing TypeError if the data argument requires encoding (a string for example). I don’t think there is a fix that is not a breaking change.

Read more comments on GitHub >

github_iconTop Results From Across the Web

utf 8 - python requests.get() returns improperly decoded text ...
text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may...
Read more >
R: Character Encodings and 'stringi' - index.utf8.md
Character vector output with print , cat etc. silently reencodes each string so that it can be properly shown e.g. in the R's...
Read more >
UTF-8: The Secret of Character Encoding - HTML Purifier
The Encoder now transforms the text back from UTF-8 to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it will be...
Read more >
Unicode & Character Encodings in Python: A Painless Guide
In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...
Read more >
character encodings in Perl - Perldoc Browser
Perl is widely used to manipulate data of many types: not only strings of ... For example, to convert ISO-8859-1 data into Microsoft's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found