HTTP 1 request headers decoded using default encoding instead of ISO-8859-1
See original GitHub issueIs there an existing issue for this?
- I have searched the existing issues
Describe the bug
headers are decoded here without specifying their encoding:
On my system (osx using python 3.10.8 installed via homebrew) this causes bytes that are valid characters in ISO-8859-1 but not in UTF-8 to be decoded as surrogate escape characters, e.g. b"\x80"
becomes "\udf80"
instead of "\x80"
Code snippet
No response
Expected Behavior
headers encoded as ISO-8859-1 with no MIME type to be decoded correctly without using UTF-8 surrogate escape characters.
How do you run Sanic?
As a script (app.run
or Sanic.serve
)
Operating System
linux
Sanic Version
22.9.1
Additional context
this used to work as expected in Sanic<=20.12.7
Issue Analytics
- State:
- Created 10 months ago
- Reactions:1
- Comments:13 (11 by maintainers)
Top Results From Across the Web
Assume ISO-8859-1 (instead of UTF-8) encoding for ASGI ...
Currently, headers in Falcon's ASGI package will be decoded using the Python's default UTF-8 decoding. For instance: falcon/falcon/asgi/request.
Read more >UTF-8 in HTTP headers - Jmix
UTF-8 in HTTP headers. HTTP 1.1 is a well-known hypertext protocol for data transfer. HTTP messages are encoded with ISO-8859-1 (which can be...
Read more >What encoding to use when interpreting HTTP/1.1 header field ...
Recipients usually decode using ISO-8859-1, which at least allows recovery later on (because it'll preserve all octets).
Read more >HTTP/1.1: Header Field Definitions
If an Accept-Encoding field is present in a request, and if the server cannot send a response which is acceptable according to the...
Read more >RFC 7230: Hypertext Transfer Protocol (HTTP/1.1)
RFC 7230 HTTP/1.1 Message Syntax and Routing June 2014 3.2.1. ... Protocol (HTTP) is a stateless application- level request/response protocol that uses ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@relud After a discussion we are now planning to use ISO-8859-1 for request headers within the Sanic built-in server as well, to make it match ASGI and other frameworks that behave this way, also matching the behavior of older Sanic releases. It is noted that ISO-8859-1 can also be encoded back to original bytes, from which one can obtain UTF-8 or other decoding if needed.
As your bug report was apparently the first we’ve received on this, the issue is probably not affecting many at all, but at least this should make your implementation a bit easier. This is a breaking change, so no promises yet on when it will be released, even if everyone else is using ASCII headers and thus isn’t affected by it.
Earlier Sanic versions handled headers as ISO-8859-1, which was causing trouble when they actually were in UTF-8 (more common nowadays). I had to put a lot of thought into this while reimplementing the HTTP parser code as leaving them as
bytes
wouldn’t be practical either. The surrogate escape coding is WTF-8 which indeed is meant for preserving garbage, being able to restore original bytes of what might be ill-formed UTF-8. I’m glad you found use for this detail of Sanic’s implementation, being able to restore those bytes instead of simply showing “replacement character” as a naive implementation might.