Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

response.text converting to utf-8 poorly

See original GitHub issue

with response = requests.get('some_xml_api_endpoint') The xml response has the header: <?xml version="1.0" encoding="UTF-8"?> I’m using ubuntu, python3, requests 2.24.0, and saving that file in utf-8 on a ext-4 partition.

response.text gives a lot of broken characters. response.content.decode(‘utf-8’) works well.

The response is rendered:

Reptileâs Tail response.text broke the curly quote character Reptiles’s Tail content.decode(‘utf-8’) handled the curly-quote well

Could be described as: print(bytes([226, 128, 153]).decode(‘utf-8’))

I expected response.text to behave the same as response.content.decode('utf-8'). I’ve never read any warnings in the guides or docs to anticipate this behavior. Some prominent warnings would help.

Unless .decode(‘utf-8’) is doing some magic heavy-lifting.

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

sigmavirus24commented, Sep 26, 2020

We can’t possibly document every possible use-case in an HTTP client. That said, I think it’s fair that for cases where we have defaults like this we add a table:

If the Content-Type of the response is one of these well-known types, we'll assume an encoding if none is provided:

================= ================
Content-Type      Default Encoding
================= ================
text/*            ISO-8859-1
application/json  UTF-8
================= ================

1reaction

sethmlarsoncommented, Sep 25, 2020

Taken from RFC 6657 Section 3:

In accordance with option (a) above, registrations for “text/*” media types that can transport charset information inside the corresponding payloads (such as “text/html” and “text/xml”) SHOULD NOT specify the use of a “charset” parameter, nor any default value, in order to avoid conflicting interpretations should the “charset” parameter value and the value specified in the payload disagree.

Since text/xml defines it’s charset within the payload we can’t make an assumption of a charset from an HTTP perspective. This is up to the requester (ie, users of the client) to parse the XML prolog for the encoding of the document. A potential route for you is:

Detect a Content-Type: text/xml
Pull the raw binary body of the response
Read and parse the prolog from the binary body (ie <?xml version = "1.0" encoding = "UTF-8" ...>)
Using the encoding parameter from the document decode the binary body
Parse the rest of the XML body as normal

Also unrelated because we’re not doing the “wrong” thing here by guessing but here’s the code that decides iso-8859-1 for any text/* responses by default: https://github.com/psf/requests/blob/1ca1c52e698b13d3d5cc0755a6450306d880b933/requests/utils.py#L497

Top Results From Across the Web

utf 8 - python requests.get() returns improperly decoded text ...

get() returns improperly encoded data. However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8' , it returns ...

Python requests not handling unicode characters in UTF-8 as ...

I've confirmed the response headers on the page are UTF-8, ... str(sub) r = requests.get(url) r.encoding = "utf-8" data = json.loads(r.text) ...

Unicode & Character Encodings in Python: A Painless Guide

Unicode vs UTF-8; Encoding and Decoding in Python 3; Python 3: All-In on Unicode ... It doesn't tell you enough about how to...

Encode a String to UTF-8 in Java - Baeldung

This String encoded using US_ASCII gives us the value “Entwickeln Sie mit Vergn?gen” when printed because it doesn't understand the non-ASCII ü character....

API response in East Asian text cannot be ... - OutSystems

However, there's a problem with the charset: it should be "utf-8" (with a dash), not "utf8". So that's a bug in the other...