question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SplashMiddleware breaks "script": 1 invocations

See original GitHub issue

See: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py#L324

This line: body = json.dumps(args, ensure_ascii=False, sort_keys=True, indent=4)

Breaks any SplashRequest that is trying to emulate the following:

# Render page and execute simple Javascript function, display the js output
curl -X POST -H 'content-type: application/javascript' \
    -d 'function getAd(x){ return x; } getAd("abc");' \
    'http://localhost:8050/render.json?url=http://domain.com&script=1'

If you just steamroll the intended POST body with a json dump of the args, then it’s basically impossible to structure a render.json request (that doesn’t use LUA, from what I can see).

In fact, a Splash Request structured as such:

splash_request = scrapy.Request(
            my_interesting_request_url,
            callback=self.parse,
            errback=self.err_back.errback_httpbin,
            meta={
                "request_item": request_item,
                "splash": {
                    "args": {
                        "method": "POST",
                        "body": JS_SOURCE,
                        "url": request_item["request_url"],
                        "html": 1,
                        "script": 1,
                        "max-timeout": Config.SPLASH_MAX_TIMEOUT,
                        "slots": Config.SPLASH_NUM_OF_SLOTS,
                    },
                    "splash_headers": {"Content-Type": "application/javascript"},
                    "endpoint": Config.SPLASH_RENDER_JSON_ENDPOINT + "?html=1&script=1&url=" + request_item["request_url"]
                }
            },
        )

Will give me a body like this:

b'{\n    "headers": {\n        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",\n        "Accept-Encoding": "gzip,deflate",\n
      "Accept-Language": "en",\n        "User-Agent": "Scrapy/1.4.0 (+http://scrapy.org)"\n    },\n    "html": 1,\n    "js_source": "JSON.stringify(WSI.assortmentJson)",\n
    "max-timeout": 3600,\n    "method": "GET",\n    "proxy": "http://ec2-54-234-146-220.compute-1.amazonaws.com:8080",\n    "script": 1,\n    "slots": 5,\n    "url": "https://www.potterybarn.com/products/cambria-stoneware-mug-stone/"\n}'

When I log the request body in MySpider.parse (retrieved the response object). You can even see headers get set into the args object here: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py#L322

This is obviously wrong. I haven’t the faintest idea of why headers would be put into the POST body. If there is a “right way” to do it, it’s not clear from the documentation - in fact the documentation seems to be misleading. I’ve also tried with SplashRequest, but this is just another layer of abstraction that doesn’t treat the actual problem of the request body being screwed up.

I’ll try and submit a pull request for this in the next few weeks, but honestly I’m surprised this hasn’t been complained about elsewhere.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Datamancecommented, Aug 6, 2018

I believe we tried that to no avail, but I’ll go ahead and give it a shot this week (probably tomorrow). For the time being I’m just wrapping everything with LUA, which is ok but a little overkill for the need here.

Thanks for your help! I really appreciate the hard work that goes into contributing to open source (I’m trying to figure out how to do more of it myself), so no worries.

1reaction
lopuhincommented, Aug 6, 2018

Thanks for your patience @Datamance , now I see, I didn’t know about this API feature. Indeed SplashMiddleware can’t send such a request, but I think the same can be achieved by roughly the following

SplashRequest(
    'http://domain.com',
    args={'js_source': 'function getAd(x){ return x; } getAd("abc");', 'script': 1},
    endpoint='render.json')
Read more comments on GitHub >

github_iconTop Results From Across the Web

Practice Web Scraping With Beautiful Soup and Python by ...
This is a decent practice session and has troubleshooting and critical thinking involved as he pieces the code together.
Read more >
FAQ — Splash 3.5 documentation - Read the Docs
Splash Lua script does too many things¶. When a script fetches many pages or uses large delays then timeouts are inevitable. Sometimes you...
Read more >
linux pbc - OSCHINA - 中文开源技术交流社区
最近需要在android下面做点东西,需要用到以前编写好的C的代码,查询后,得知可以在android下使用NDK工具,生成本地C的so库,通过java的JNI来调用C.so库中的函数。
Read more >
前台获取的是Object 转成aray - CSDN
1. requests模块介绍. 1.1 requests模块的作用:. 发送http请求,获取响应数据. 1.2 requests模块是一个第三方模块,需要在你的python(虚拟)环境中额外安装.
Read more >
Issues-scrapy-plugins/scrapy-splash - PythonTechWorld
SplashMiddleware breaks "script": 1 invocations. 888. See: https://github.com/scrapy-plugins/scrapy-splash/blob/master/scrapy_splash/middleware.py#L324 This ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found