Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] cherrypy._cpreqbody.Part.read_headers incorrectly assumes the part headers are encoded in ISO-8859-1

See original GitHub issue

cherrypy._cpreqbody.Part.read_headers incorrectly assumes the part headers are encoded in ISO-8859-1 (the encoding is harcoded in the function):

https://github.com/cherrypy/cherrypy/blob/2d1b3c6120f7776918d9b67c25baf2e45e4b3bbd/cherrypy/_cpreqbody.py#L641

This is a significant issue when uploading a file with a non-ASCII filename in a multipart/form-data payload. For example, on a website based on UTF-8, the filename “Paris en été” is incorrectly decoded as “Paris en Ã©tÃ©”.

The HTML 5.2 specification states:

Encode the […] form data set using the rules described by RFC 7578, and return the resulting byte stream. […] File names included in the generated multipart/form-data resource (as part of file fields) must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. newlines could be removed from file names, quotes could be changed to “%22”, and characters not expressible in the selected character encoding could be replaced by other characters). User agents must not use the RFC 2231 encoding suggested by RFC 2388."

Here is how the character encoding must be selected according to the same specification:

If the algorithm was invoked with an explicit character encoding, let the selected character encoding be that encoding. (This algorithm is used by other specifications, which provide an explicit character encoding to avoid the dependency on the form element described in the next paragraph.)

Otherwise, if the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.

Otherwise, if the form element has no accept-charset attribute, but the document’s character encoding is an ASCII-compatible encoding, then that is the selected character encoding.

Otherwise, let the selected character encoding be UTF-8.

RFC 7578 states:

Some commonly deployed systems use multipart/form-data with file names directly encoded including octets outside the US-ASCII range. The encoding used for the file names is typically UTF-8, although HTML forms will use the charset associated with the form."

A temporary workaround in your handler is to reencode the filename received from CherryPy to ISO-8859-1 and decode it again using UTF-8:

my_file_param.filename.encode('iso-8859-1').decode('utf-8', 'replace')

The ‘replace’ argument is an application of Postel’s principle of robustness: “be conservative in what you do, be liberal in what you accept from others”.

In the past, I submitted a pull request to try to fix this, but I failed to make it backward-compatible. I’m not submitting a new PR right now, because after having read the related code in Django and Flask/Werkzeug, I’m not sure it’s possible to fix this without breaking backward-compatibility in some edge cases.

CherryPy version: 17.0.0 Python version: 2.7

Issue Analytics

State:
Created 5 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

1reaction

ian-ottocommented, Jan 28, 2019

After looking at the previous PR, it looks like less is more here. We might want to just swap the instances of US-ASCII to UTF-8 instead.

0reactions

ngrillycommented, Jan 31, 2019

Technically, any utf-8 bytestring is valid iso-8859-1.

Agreed.

This issue of header decoding matters mainly, if not only, for Content-Disposition, because it contains a user-supplied file name.

Agreed.

Actual user-supplied file names that are valid utf-8 are never intended to be iso-8859-1. Practicality beats purity.

I was thinking of a user-supplied file name encoded in ISO-8859-1 in the Content-Disposition header, and which would be mistakenly decoded as UTF-8.

Currently, CherryPy decodes filenames in the Content-Disposition header as ISO-8859-1. The idea we’re floating here is to try UTF-8 first and fallback to ISO-8859-1. As you wrote earlier, it should be backward compatible in most cases. I just want to point out that for some very weird filenames, this could change CherryPy behavior.

The expected encoding should be set in the request, where it can be changed in the request config (which is endpoint dependent and loaded by the dispatcher before headers are processed).

Agreed.

I am not sure what you mean by setting it in the app, by opposition to in the request.

I was thinking about something like cherrypy.request.app.encoding, but it’s a bad idea 😃 Something like cherrypy.request.encoding should be preferred.