Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Do not re-encode a UTF-8 string as ISO-8859-1 silently in HTTP requests

See original GitHub issue

This issue is about sending an HTTP request, and not about the response.

    headers = { 'Content-Type': 'application/json; charset=utf-8' }
    response = requests.post(url, headers=headers, data=my_native_python3_utf8_string)

Expected Result

I expected the request to be sent as UTF-8, after all, everything in the Python 3 ecosystem uses UTF-8 by default.

Actual Result

My UTF-8 string is silently re-encoded as ISO-8859-1, causing confusing bugs at the recipient. The code in http.client is doing it:

https://github.com/python/cpython/blob/c3dd7e45cc5d36bbe2295c2840faabb5c75d83e4/Lib/http/client.py#L1312

In my opinion, RFC 2616 Section 3.7.1 does not apply here.
The whole Python 3 ecosystem uses UTF-8 by default, so I find it utterly confusing that a library silently re-encodes a UTF-8 string as ISO-8859-1.

Suggestion

I suggest that if requests takes a native UTF-8 string as data argument, it makes sure that it is handled as a UTF-8 string all the way, and won’t be silently re-encoded as ISO-8859-1. It can cause really obscure bugs otherwise.

This issue is about sending a request. Related open issues are mostly concerned with the response:

Reproduction Steps

See above.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.8"
  },
  "idna": {
    "version": "2.8"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.4"
  },
  "platform": {
    "release": "4.15.0-107-generic",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010104f",
    "version": "19.1.0"
  },
  "requests": {
    "version": "2.22.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.8"
  },
  "using_pyopenssl": true
}

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

baharevcommented, Oct 29, 2020

@sethmlarson

It is a best practice to provide bytes instead of str for the body so that no underlying library needs to guess at what encoding you want.

Yes, I have figured that out the hard way.

The problem is that it is incredibly easy to forget an .encode() call (accidentally pass the string unencoded). To add salt to the wound, it will even “work” with ASCII characters and you won’t even notice you have a bug (you forgot an .encode()) until you get a more exotic character in the string. Ouch.

In other words, I find the current interface a loaded gun, and you can easily shoot yourself in the foot.

Requests is under feature freeze with no room for breaking changes so this change is extremely unlikely to land, regardless of its utility.

OK, I understand.

0reactions

baharevcommented, Oct 29, 2020

@sethmlarson +1s are great, I really appreciate it, but what should I do so that this issue is eventually addressed?

If this issue remains closed, it will be forgotten.

A “fix” could be as simple as throwing TypeError if the data argument requires encoding (a string for example). I don’t think there is a fix that is not a breaking change.

Top Results From Across the Web

utf 8 - python requests.get() returns improperly decoded text ...

text to access the decoded Unicode stream. If the HTTP server does not care about the correct encoding, the value of .text may...

R: Character Encodings and 'stringi' - index.utf8.md

Character vector output with print , cat etc. silently reencodes each string so that it can be properly shown e.g. in the R's...

UTF-8: The Secret of Character Encoding - HTML Purifier

The Encoder now transforms the text back from UTF-8 to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it will be...

Unicode & Character Encodings in Python: A Painless Guide

In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at ...

character encodings in Perl - Perldoc Browser

Perl is widely used to manipulate data of many types: not only strings of ... For example, to convert ISO-8859-1 data into Microsoft's...