question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wrong type(response) for binary responses

See original GitHub issue

Description

Binary (file) responses are identified as TextResponse instead of a plain Response in spider.parse.

Steps to Reproduce

From this URL https://www.ecb.europa.eu/press/pr/date/2004/html/pr040702.en.html We can get this link https://www.ecb.europa.eu/pub/redirect/pub_5874_en.html (text is pdf 1692 kB). It redirects to https://www.ecb.europa.eu/pub/pdf/other/developmentstatisticsemu200406en.pdf

Expected behavior: [What you expect to happen]

In the spider, isinstance(response, TextResponse) should be False.

Actual behavior: [What actually happens]

In the spider, isinstance(response, TextResponse) is True, even though the Content-Type header is application/pdf.

Versions

Scrapy       : 1.7.3
lxml         : 4.4.2.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.6.8 (default, Jan 14 2019, 11:02:34) - [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : Linux-4.15.0-72-generic-x86_64-with-Ubuntu-18.04-bionic

Extra info

Probably we should handle known binary types in Content-Type here https://github.com/scrapy/scrapy/blob/master/scrapy/responsetypes.py At least, application/pdf should be in the list. It would be nice to add some mechanism to allow developers to extend the mapping on the fly, as we can get new types in a project basis and it would be easier update the desired behavior.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:28 (23 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Mar 3, 2021

So I must send PR to any repo of your organization or specific one in this case?

In this case I guess you are free to choose the target project. Just make sure you pick one that is actively maintained, i.e. one with recent commits.

Can I create a new separate issue about this new lib here to communicate with you and regarding lib details, and regarding gsoc details at all?

I think we can keep that discussion to this issue, as it’s quite relevant to the issue.

2reactions
nyovcommented, Mar 21, 2020

And I disagree on that point with @kmike , but that’s my personal point of view. I’m not who you need to impress if you’re a gsoc student 😄 .

To my understanding, mime sniffing has two problem domains: header sniffing of a passed MIME type (and evaluating “no-sniff” X-Content-Type-Options header etc.): these can’t be done by file/magic. This is the logic to decide if content sniffing is needed at all, or if we go by headers or file-endings and stuff.

And then the other is content body sniffing. These are all the byte pattern and -mask things listed in that mimesniff spec page. These would be easily implemented using file magic grammar, from what I can see, and you’d be hard-pressed to reinvent a similarly extensible parser in a gsoc project. “mimesniff” stops after speccing mp4, webm and mp3 media type parsers, but file has pattern magic to detect some hundred of movie format variations alone, out of the box. Wrapping libmagic for scrapy could give users here a much more customizable framework solution later, than a rigid custom parser that implements the bare minimals given by “mimesniff”.

Just a single example. Let’s use GIF. That whatwg spec page lists it like this:

Byte Pattern Pattern Mask Leading Bytes to Be Ignored Image MIME Type Note
47 49 46 38 37 61 FF FF FF FF FF FF None. image/gif The string “GIF87a”, a GIF signature.
47 49 46 38 39 61 FF FF FF FF FF FF None. image/gif The string “GIF89a”, a GIF signature.

Here is how you would write a libmagic implementation fitting that description:

# GIF
0       string          GIF8            GIF image data
>4      string          7a              \b, version 8%s
!:mime  image/gif
>4      string          9a              \b, version 8%s
!:mime  image/gif

That’s it, pretty much copied from the existing code. It says: start at byte 0 (don’t ignore any leading bytes), then scan for the string “GIF8”. If that matches (go to the next line), skip (those) 4 bytes and then look for string “7a”, or “9a”, and print that as the gif version number. If either of these match, return the mime type as “image/gif”. If that looks too easy, you can tell magic to look for “47 49 46 38” as big endian byte pattern instead (now it looks more like the mime-sniff spec thing):

0       belong          0x47494638       GIF image data
>4      beshort         0x3961           \b, version 89a 
!:mime  image/gif

And libmagic lets you scan for all kinds of little, big or native endian bytestrings, floating point numbers; and more. Or let’s see how libmagic detects webm by default, which requires 40 lines of parser steps on the mimesniff page:

0               belong          0x1a45dfa3
>4              search/4096     \x42\x82
>>&1            string          webm            WebM
!:mime  video/webm

I won’t explain those 4 lines, but they do pretty much what the whatwg parser steps ask for, if not as rigidly, except for the bigger search buffer of 4096 here versus 38-4 bytes in the “spec”.

Read more comments on GitHub >

github_iconTop Results From Across the Web

jQuery: access binary AJAX response inside complete ...
The return value of $. ajax(settings) contains the binary data, but the XHR object is no longer available -- so it seems not...
Read more >
Binary response type - Recipes - Mock Service Worker Docs
Support of binary data allows to send any kind of media content (images, audio, documents) in a mocked response.
Read more >
Response‐adaptive designs for binary responses: How to ...
In this paper, we discuss and address a major criticism levelled at RAR: namely, type I error inflation due to an unknown time...
Read more >
Set up method responses in API Gateway - AWS Documentation
The status code of a method response defines a type of response. For example, responses of 200, 400, and 500 indicate successful, client-side...
Read more >
How to offer patient benefit while being robust to time trends?
Response-adaptive designs for binary responses: How to offer patient ... type I error inflation due to an unknown time trend over the course...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found