Wrong type(response) for binary responses
See original GitHub issueDescription
Binary (file) responses are identified as TextResponse
instead of a plain Response
in spider.parse
.
Steps to Reproduce
From this URL https://www.ecb.europa.eu/press/pr/date/2004/html/pr040702.en.html
We can get this link https://www.ecb.europa.eu/pub/redirect/pub_5874_en.html (text is pdf 1692 kB
).
It redirects to https://www.ecb.europa.eu/pub/pdf/other/developmentstatisticsemu200406en.pdf
Expected behavior: [What you expect to happen]
In the spider, isinstance(response, TextResponse)
should be False
.
Actual behavior: [What actually happens]
In the spider, isinstance(response, TextResponse)
is True
, even though the Content-Type
header is application/pdf
.
Versions
Scrapy : 1.7.3
lxml : 4.4.2.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic
Extra info
Probably we should handle known binary types in Content-Type
here https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py
At least, application/pdf
should be in the list.
It would be nice to add some mechanism to allow developers to extend the mapping on the fly, as we can get new types in a project basis and it would be easier update the desired behavior.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:28 (23 by maintainers)
Top GitHub Comments
In this case I guess you are free to choose the target project. Just make sure you pick one that is actively maintained, i.e. one with recent commits.
I think we can keep that discussion to this issue, as it’s quite relevant to the issue.
And I disagree on that point with @kmike , but that’s my personal point of view. I’m not who you need to impress if you’re a gsoc student 😄 .
To my understanding, mime sniffing has two problem domains: header sniffing of a passed MIME type (and evaluating “no-sniff” X-Content-Type-Options header etc.): these can’t be done by
file
/magic
. This is the logic to decide if content sniffing is needed at all, or if we go by headers or file-endings and stuff.And then the other is content body sniffing. These are all the byte pattern and -mask things listed in that mimesniff spec page. These would be easily implemented using file magic grammar, from what I can see, and you’d be hard-pressed to reinvent a similarly extensible parser in a gsoc project. “mimesniff” stops after speccing mp4, webm and mp3 media type parsers, but
file
has pattern magic to detect some hundred of movie format variations alone, out of the box. Wrappinglibmagic
for scrapy could give users here a much more customizable framework solution later, than a rigid custom parser that implements the bare minimals given by “mimesniff”.Just a single example. Let’s use GIF. That whatwg spec page lists it like this:
Here is how you would write a libmagic implementation fitting that description:
That’s it, pretty much copied from the existing code. It says: start at byte 0 (don’t ignore any leading bytes), then scan for the string “GIF8”. If that matches (go to the next line), skip (those) 4 bytes and then look for string “7a”, or “9a”, and print that as the gif version number. If either of these match, return the mime type as “image/gif”. If that looks too easy, you can tell
magic
to look for “47 49 46 38” as big endian byte pattern instead (now it looks more like the mime-sniff spec thing):And libmagic lets you scan for all kinds of little, big or native endian bytestrings, floating point numbers; and more. Or let’s see how libmagic detects webm by default, which requires 40 lines of parser steps on the mimesniff page:
I won’t explain those 4 lines, but they do pretty much what the whatwg parser steps ask for, if not as rigidly, except for the bigger search buffer of 4096 here versus 38-4 bytes in the “spec”.