response.text converting to utf-8 poorly
See original GitHub issuewith response = requests.get('some_xml_api_endpoint')
The xml response has the header: <?xml version="1.0" encoding="UTF-8"?>
I’m using ubuntu, python3, requests 2.24.0, and saving that file in utf-8 on a ext-4 partition.
response.text gives a lot of broken characters. response.content.decode(‘utf-8’) works well.
The response is rendered:
Reptileâs Tail response.text broke the curly quote character Reptiles’s Tail content.decode(‘utf-8’) handled the curly-quote well
Could be described as: print(bytes([226, 128, 153]).decode(‘utf-8’))
I expected response.text
to behave the same as response.content.decode('utf-8')
. I’ve never read any warnings in the guides or docs to anticipate this behavior. Some prominent warnings would help.
Unless .decode(‘utf-8’) is doing some magic heavy-lifting.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
We can’t possibly document every possible use-case in an HTTP client. That said, I think it’s fair that for cases where we have defaults like this we add a table:
Taken from RFC 6657 Section 3:
Since text/xml defines it’s charset within the payload we can’t make an assumption of a charset from an HTTP perspective. This is up to the requester (ie, users of the client) to parse the XML prolog for the encoding of the document. A potential route for you is:
Content-Type: text/xml
<?xml version = "1.0" encoding = "UTF-8" ...>
)encoding
parameter from the document decode the binary bodyAlso unrelated because we’re not doing the “wrong” thing here by guessing but here’s the code that decides iso-8859-1 for any
text/*
responses by default: https://github.com/psf/requests/blob/1ca1c52e698b13d3d5cc0755a6450306d880b933/requests/utils.py#L497