SplashMiddleware breaks "script": 1 invocations
See original GitHub issueSee: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py#L324
This line:
body = json.dumps(args, ensure_ascii=False, sort_keys=True, indent=4)
Breaks any SplashRequest that is trying to emulate the following:
# Render page and execute simple Javascript function, display the js output
curl -X POST -H 'content-type: application/javascript' \
-d 'function getAd(x){ return x; } getAd("abc");' \
'http://localhost:8050/render.json?url=http://domain.com&script=1'
If you just steamroll the intended POST body with a json dump of the args, then it’s basically impossible to structure a render.json
request (that doesn’t use LUA, from what I can see).
In fact, a Splash Request structured as such:
splash_request = scrapy.Request(
my_interesting_request_url,
callback=self.parse,
errback=self.err_back.errback_httpbin,
meta={
"request_item": request_item,
"splash": {
"args": {
"method": "POST",
"body": JS_SOURCE,
"url": request_item["request_url"],
"html": 1,
"script": 1,
"max-timeout": Config.SPLASH_MAX_TIMEOUT,
"slots": Config.SPLASH_NUM_OF_SLOTS,
},
"splash_headers": {"Content-Type": "application/javascript"},
"endpoint": Config.SPLASH_RENDER_JSON_ENDPOINT + "?html=1&script=1&url=" + request_item["request_url"]
}
},
)
Will give me a body like this:
b'{\n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",\n "Accept-Encoding": "gzip,deflate",\n
"Accept-Language": "en",\n "User-Agent": "Scrapy/1.4.0 (+http://scrapy.org)"\n },\n "html": 1,\n "js_source": "JSON.stringify(WSI.assortmentJson)",\n
"max-timeout": 3600,\n "method": "GET",\n "proxy": "http://ec2-54-234-146-220.compute-1.amazonaws.com:8080",\n "script": 1,\n "slots": 5,\n "url": "https://www.potterybarn.com/products/cambria-stoneware-mug-stone/"\n}'
When I log the request body in MySpider.parse (retrieved the response object). You can even see headers get set into the args object here: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py#L322
This is obviously wrong. I haven’t the faintest idea of why headers would be put into the POST body. If there is a “right way” to do it, it’s not clear from the documentation - in fact the documentation seems to be misleading. I’ve also tried with SplashRequest, but this is just another layer of abstraction that doesn’t treat the actual problem of the request body being screwed up.
I’ll try and submit a pull request for this in the next few weeks, but honestly I’m surprised this hasn’t been complained about elsewhere.
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (5 by maintainers)
Top GitHub Comments
I believe we tried that to no avail, but I’ll go ahead and give it a shot this week (probably tomorrow). For the time being I’m just wrapping everything with LUA, which is ok but a little overkill for the need here.
Thanks for your help! I really appreciate the hard work that goes into contributing to open source (I’m trying to figure out how to do more of it myself), so no worries.
Thanks for your patience @Datamance , now I see, I didn’t know about this API feature. Indeed SplashMiddleware can’t send such a request, but I think the same can be achieved by roughly the following