Scrapy "session" extension
See original GitHub issueI’m interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta
dictionary so much. Basically, the implementation I’m thinking of will be an abstract subclass of scrapy.Spider
which I will call SessionSpider
. The primary differences will be:
-
Instead of the normal spider parse callback signature
(self, response)
,SessionSpider
will have(self, session, response)
callbacks. Thesession
argument will be some kind ofSession
object that at least keeps track of cookies (and possibly proxies and certain headers). -
This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run “forever” on an unbounded list of URLs.
-
A
SessionSpider
callback that wants to create requests with the same session will generate requests using asession.Request
factory method that returns ascrapy.Request
. This method will take care of merging session variables with the new request. -
I’m hoping to implement most of the features I want by having the
Session
object do themeta
manipulation behind the scenes so thatSessionSpider
subclasses don’t have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.
I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:15 (8 by maintainers)
Top GitHub Comments
Thoughts: https://github.com/ThomasAitken/scrapy-sessions ?
@ThomasAitken From https://github.com/ThomasAitken/scrapy-sessions readme:
It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin
http
module.It is possible to reach
CookieMiddleware
object with it’s content directly from spiderstart_requests
andparse
methods (fromcrawler
object as well as the most of ofther middlewares/moduels:Unfortunately
Crawler
object doesn’t have methods to get middleware object from it’s name so it is possible with this… trick:With direct access to
CookieJars
fromCookieMiddleware
object andcookie_jar
ID fromresponse.meta
- You already able to make any manipulations with sessionsUnfortunately this is true. By default scrapy use single
CookieJar
object for all requests. Single user agent from settings. +a lot of additional issues in case of multiple proxies used. The most of publicly available proxy rotaion modules for scrapy don’t createCookieJar
per proxy - they are not session safe.Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.
For rest of cases. I agree that idea is actual.
In order to bind proxy address to cookiejar it is enough to use the same key value for
proxy
andcookiejar
request meta keys (no extra middlewares required) as I did on thisgist code sample