Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy "session" extension

See original GitHub issue

I’m interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta dictionary so much. Basically, the implementation I’m thinking of will be an abstract subclass of scrapy.Spider which I will call SessionSpider. The primary differences will be:

Instead of the normal spider parse callback signature (self, response), SessionSpider will have (self, session, response) callbacks. The session argument will be some kind of Session object that at least keeps track of cookies (and possibly proxies and certain headers).
This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run “forever” on an unbounded list of URLs.
A SessionSpider callback that wants to create requests with the same session will generate requests using a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.
I’m hoping to implement most of the features I want by having the Session object do the meta manipulation behind the scenes so that SessionSpider subclasses don’t have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.

I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878

Issue Analytics

State:
Created 5 years ago
Reactions:5
Comments:15 (8 by maintainers)

Top GitHub Comments

2reactions

ThomasAitkencommented, Apr 30, 2021

Thoughts: https://github.com/ThomasAitken/scrapy-sessions ?

1reaction

GeorgeA92commented, May 1, 2021

@ThomasAitken From https://github.com/ThomasAitken/scrapy-sessions readme:

Scrapy’s sessions are a black box

It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin http module.

…They can’t be exposed within a scrape and they can’t be directly altered. 2. Scrapy makes it very difficult to easily replace a session (and/or general ‘profile’) unilaterally across all requests that are scheduled or enqueued. This is important for engaging with websites that have session-expiry logic.

It is possible to reach CookieMiddleware object with it’s content directly from spider start_requests and parse methods (from crawler object as well as the most of ofther middlewares/moduels:

Unfortunately Crawler object doesn’t have methods to get middleware object from it’s name so it is possible with this… trick:

class Myspider(scrapy.Spider):
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.CookieMiddleware= [middleware for middleware in downloader_middlewares if "CookieMiddleware" in str(type(middleware))][0]

With direct access to CookieJars from CookieMiddleware object and cookie_jar ID from response.meta - You already able to make any manipulations with sessions

Scrapy provides no native capability for maintaing distinct profiles (client identities) within a single scrape.

Unfortunately this is true. By default scrapy use single CookieJar object for all requests. Single user agent from settings. +a lot of additional issues in case of multiple proxies used. The most of publicly available proxy rotaion modules for scrapy don’t create CookieJar per proxy - they are not session safe.

The idea of this tool is to manage distinct client identities within a scrape. The identity consists of two or more of the following attributes: session + user agent + proxy.

Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.

For rest of cases. I agree that idea is actual.

from w3lib.http import basic_auth_header
PROFILES = [
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
]

In order to bind proxy address to cookiejar it is enough to use the same key value for proxy and cookiejar request meta keys (no extra middlewares required) as I did on this gist code sample

Top Results From Across the Web

scrapy-sessions - PyPI

scrapy -sessions. A session-management extension for Scrapy. PyPI Version. Overview. This library resolves at least three long-standing issues in Scrapy's ...

Requests and Responses — Scrapy 2.7.1 documentation

This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc).

Scrapy - how to manage cookies/sessions - Stack Overflow

scrapy -sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles ...

Python Scrapy Login Forms: How To Log Into Any Website

... session/browser authenication, managing IP addresses. Luckily for us Scrapy developers, Scrapy provides us a whole suite of tools and extensions we can ......

Working with COOKIES and HEADERS in Python SCRAPY ...

PROXY ROTATION crash course | Python + Requests & BeautifulSoup · Difference between cookies, session and tokens · Web Scraping - Live Coding...