[bug] cherrypy._cpreqbody.Part.read_headers incorrectly assumes the part headers are encoded in ISO-8859-1
See original GitHub issuecherrypy._cpreqbody.Part.read_headers
incorrectly assumes the part headers
are encoded in ISO-8859-1 (the encoding is harcoded in the function):
This is a significant issue when uploading a file with a non-ASCII filename in a multipart/form-data payload. For example, on a website based on UTF-8, the filename “Paris en été” is incorrectly decoded as “Paris en été”.
The HTML 5.2 specification states:
Encode the […] form data set using the rules described by RFC 7578, and return the resulting byte stream. […] File names included in the generated multipart/form-data resource (as part of file fields) must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. newlines could be removed from file names, quotes could be changed to “%22”, and characters not expressible in the selected character encoding could be replaced by other characters). User agents must not use the RFC 2231 encoding suggested by RFC 2388."
Here is how the character encoding must be selected according to the same specification:
- If the algorithm was invoked with an explicit character encoding, let the selected character encoding be that encoding. (This algorithm is used by other specifications, which provide an explicit character encoding to avoid the dependency on the form element described in the next paragraph.)
- Otherwise, if the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.
- Otherwise, if the form element has no accept-charset attribute, but the document’s character encoding is an ASCII-compatible encoding, then that is the selected character encoding.
- Otherwise, let the selected character encoding be UTF-8.
RFC 7578 states:
Some commonly deployed systems use multipart/form-data with file names directly encoded including octets outside the US-ASCII range. The encoding used for the file names is typically UTF-8, although HTML forms will use the charset associated with the form."
A temporary workaround in your handler is to reencode the filename received from CherryPy to ISO-8859-1 and decode it again using UTF-8:
my_file_param.filename.encode('iso-8859-1').decode('utf-8', 'replace')
The ‘replace’ argument is an application of Postel’s principle of robustness: “be conservative in what you do, be liberal in what you accept from others”.
In the past, I submitted a pull request to try to fix this, but I failed to make it backward-compatible. I’m not submitting a new PR right now, because after having read the related code in Django and Flask/Werkzeug, I’m not sure it’s possible to fix this without breaking backward-compatibility in some edge cases.
CherryPy version: 17.0.0 Python version: 2.7
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
After looking at the previous PR, it looks like less is more here. We might want to just swap the instances of US-ASCII to UTF-8 instead.
Agreed.
Agreed.
I was thinking of a user-supplied file name encoded in ISO-8859-1 in the Content-Disposition header, and which would be mistakenly decoded as UTF-8.
Currently, CherryPy decodes filenames in the Content-Disposition header as ISO-8859-1. The idea we’re floating here is to try UTF-8 first and fallback to ISO-8859-1. As you wrote earlier, it should be backward compatible in most cases. I just want to point out that for some very weird filenames, this could change CherryPy behavior.
Agreed.
I was thinking about something like cherrypy.request.app.encoding, but it’s a bad idea 😃 Something like cherrypy.request.encoding should be preferred.