question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

response.text converting to utf-8 poorly

See original GitHub issue

with response = requests.get('some_xml_api_endpoint') The xml response has the header: <?xml version="1.0" encoding="UTF-8"?> I’m using ubuntu, python3, requests 2.24.0, and saving that file in utf-8 on a ext-4 partition.

response.text gives a lot of broken characters. response.content.decode(‘utf-8’) works well.

The response is rendered:

Reptileâs Tail response.text broke the curly quote character Reptiles’s Tail content.decode(‘utf-8’) handled the curly-quote well

Could be described as: print(bytes([226, 128, 153]).decode(‘utf-8’))

I expected response.text to behave the same as response.content.decode('utf-8'). I’ve never read any warnings in the guides or docs to anticipate this behavior. Some prominent warnings would help.

Unless .decode(‘utf-8’) is doing some magic heavy-lifting.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sigmavirus24commented, Sep 26, 2020

We can’t possibly document every possible use-case in an HTTP client. That said, I think it’s fair that for cases where we have defaults like this we add a table:

If the Content-Type of the response is one of these well-known types, we'll assume an encoding if none is provided:

================= ================
Content-Type      Default Encoding
================= ================
text/*            ISO-8859-1
application/json  UTF-8
================= ================
1reaction
sethmlarsoncommented, Sep 25, 2020

Taken from RFC 6657 Section 3:

In accordance with option (a) above, registrations for “text/*” media types that can transport charset information inside the corresponding payloads (such as “text/html” and “text/xml”) SHOULD NOT specify the use of a “charset” parameter, nor any default value, in order to avoid conflicting interpretations should the “charset” parameter value and the value specified in the payload disagree.

Since text/xml defines it’s charset within the payload we can’t make an assumption of a charset from an HTTP perspective. This is up to the requester (ie, users of the client) to parse the XML prolog for the encoding of the document. A potential route for you is:

  • Detect a Content-Type: text/xml
  • Pull the raw binary body of the response
  • Read and parse the prolog from the binary body (ie <?xml version = "1.0" encoding = "UTF-8" ...>)
  • Using the encoding parameter from the document decode the binary body
  • Parse the rest of the XML body as normal

Also unrelated because we’re not doing the “wrong” thing here by guessing but here’s the code that decides iso-8859-1 for any text/* responses by default: https://github.com/psf/requests/blob/1ca1c52e698b13d3d5cc0755a6450306d880b933/requests/utils.py#L497

Read more comments on GitHub >

github_iconTop Results From Across the Web

utf 8 - python requests.get() returns improperly decoded text ...
get() returns improperly encoded data. However, if we have the content type explicitly as 'Content-Type:text/html; charset=utf-8' , it returns ...
Read more >
Python requests not handling unicode characters in UTF-8 as ...
I've confirmed the response headers on the page are UTF-8, ... str(sub) r = requests.get(url) r.encoding = "utf-8" data = json.loads(r.text) ...
Read more >
Unicode & Character Encodings in Python: A Painless Guide
Unicode vs UTF-8; Encoding and Decoding in Python 3; Python 3: All-In on Unicode ... It doesn't tell you enough about how to...
Read more >
Encode a String to UTF-8 in Java - Baeldung
This String encoded using US_ASCII gives us the value “Entwickeln Sie mit Vergn?gen” when printed because it doesn't understand the non-ASCII ü character....
Read more >
API response in East Asian text cannot be ... - OutSystems
However, there's a problem with the charset: it should be "utf-8" (with a dash), not "utf8". So that's a bug in the other...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found