question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wrong response.body encoding with http-equiv headers

See original GitHub issue

A Response object doesn’t seem to obey a http-equiv header for Content-Type encoding when it found a HTTP header saying different. So if the http header says ‘utf-8’ but the body content is, say, codepage 1252 and the documents’ http-equiv says 1252, then scrapy appears to still picks utf-8 for decoding body content.

That might be the right decision, but I think it’s wrong. The document itself should know it’s encoding better than a server-wide setting would.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, Mar 4, 2021

Instead of creating a pull request, I think you could just share the solution here, as people will have to copy-paste it anyway.

1reaction
majatecommented, Mar 4, 2021

Thank you for the information and your ideas! Sadly, we have a quite short time frame for our assignment, so we did not have the possibility to change approach after we had started. As of right now, we have implemented a new downloader middleware that is only included in the middleware pipeline if it is enabled in the settings. The processing of the responses is implemented as:

def process_response(self, request, response, spider):

        if isinstance(response, TextResponse) \
                and response._encoding is None \
                and response._headers_encoding() is not None \
                and response._body_declared_encoding() is not None \
                and response._headers_encoding() != response._body_declared_encoding():
            return response.replace(encoding=response._body_declared_encoding())

        return response

This would update the response to obey the encoding defined by the body over the encoding defined in the header, as well as keeping the behaviour of letting an encoding passed in the __init__ method encoding argument remain its priority above both the encoding in the body and the header. Our solution works according to our tests, but there might be some edge cases that we have forgotten to test.

We realise, as you have already mentioned, that it might be better to implement the feature directly in the TextResponse class, instead of needing to process every Response in a middleware. However, this is definitely a quick and easy solution for someone wanting to achieve this behaviour before a better solution is implemented and merged.

Would it be useful for anyone if we created a pull request (just to show our full solution) or would that be unnecessary?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Flutter http response.body bad utf8 encoding - Stack Overflow
If the server response sets the Content-Type header to application/json; charset=utf-8 the body should work as expected.
Read more >
Attribute “http-equiv” not allowed on element “meta” at this point.
While HTTP response headers can be set from the server, not everyone has access to the server configuration, so an alternative is using...
Read more >
Declaring character encodings in HTML - W3C
Quick answer. Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and ...
Read more >
Feature #2567: Net::HTTP does not handle encoding correctly
puts result.body.encoding # ASCII-8BIT <- incorrect encoding, should be UTF-8 ... What should the user expect when the response headers are wrong?
Read more >
The remote server returned an error: (400) Bad Request.
... charset=us-ascii"></HEAD> <BODY><h2>Bad Request - Invalid URL</h2> ... Response Header (include the x-correlation-id) 5. Response Body.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found