Scrapy "session" extension
See original GitHub issueI’m interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta dictionary so much. Basically, the implementation I’m thinking of will be an abstract subclass of scrapy.Spider which I will call SessionSpider. The primary differences will be:
-
Instead of the normal spider parse callback signature
(self, response),SessionSpiderwill have(self, session, response)callbacks. Thesessionargument will be some kind ofSessionobject that at least keeps track of cookies (and possibly proxies and certain headers). -
This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run “forever” on an unbounded list of URLs.
-
A
SessionSpidercallback that wants to create requests with the same session will generate requests using asession.Requestfactory method that returns ascrapy.Request. This method will take care of merging session variables with the new request. -
I’m hoping to implement most of the features I want by having the
Sessionobject do themetamanipulation behind the scenes so thatSessionSpidersubclasses don’t have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.
I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:15 (8 by maintainers)

Top Related StackOverflow Question
Thoughts: https://github.com/ThomasAitken/scrapy-sessions ?
@ThomasAitken From https://github.com/ThomasAitken/scrapy-sessions readme:
It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin
httpmodule.It is possible to reach
CookieMiddlewareobject with it’s content directly from spiderstart_requestsandparsemethods (fromcrawlerobject as well as the most of ofther middlewares/moduels:Unfortunately
Crawlerobject doesn’t have methods to get middleware object from it’s name so it is possible with this… trick:With direct access to
CookieJarsfromCookieMiddlewareobject andcookie_jarID fromresponse.meta- You already able to make any manipulations with sessionsUnfortunately this is true. By default scrapy use single
CookieJarobject for all requests. Single user agent from settings. +a lot of additional issues in case of multiple proxies used. The most of publicly available proxy rotaion modules for scrapy don’t createCookieJarper proxy - they are not session safe.Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.
For rest of cases. I agree that idea is actual.
In order to bind proxy address to cookiejar it is enough to use the same key value for
proxyandcookiejarrequest meta keys (no extra middlewares required) as I did on thisgist code sample