question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy "session" extension

See original GitHub issue

I’m interested in modifying Scrapy spider behavior slightly to add some custom functionality and avoid messing around with the meta dictionary so much. Basically, the implementation I’m thinking of will be an abstract subclass of scrapy.Spider which I will call SessionSpider. The primary differences will be:

  • Instead of the normal spider parse callback signature (self, response), SessionSpider will have (self, session, response) callbacks. The session argument will be some kind of Session object that at least keeps track of cookies (and possibly proxies and certain headers).

  • This will require a change in how the cookie middleware works. Instead of passing a cookie jar ID, the session will keep track of cookies directly. As a side note: does the default cookie middleware ever drop cookiejars? I could be missing something, but it looks to me like they stay around forever. This would be a problem for my spiders because I want them to run “forever” on an unbounded list of URLs.

  • A SessionSpider callback that wants to create requests with the same session will generate requests using a session.Request factory method that returns a scrapy.Request. This method will take care of merging session variables with the new request.

  • I’m hoping to implement most of the features I want by having the Session object do the meta manipulation behind the scenes so that SessionSpider subclasses don’t have to touch meta as much. However, I will also have to modify/add middleware, since I want to change how cookiejars are passed around.

I thought I would post this here just to see what thoughts people have. Is this is a bad idea? Has it been tried before? Any issues I might run into? I see that this kind of thing has been discussed before: #1878

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:5
  • Comments:15 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
ThomasAitkencommented, Apr 30, 2021
1reaction
GeorgeA92commented, May 1, 2021

@ThomasAitken From https://github.com/ThomasAitken/scrapy-sessions readme:

Scrapy’s sessions are a black box

It is not true. Basically CookiesMiddleware is a wrapper around dictionary with CookieJar objects from python builtin http module.

…They can’t be exposed within a scrape and they can’t be directly altered. 2. Scrapy makes it very difficult to easily replace a session (and/or general ‘profile’) unilaterally across all requests that are scheduled or enqueued. This is important for engaging with websites that have session-expiry logic.

It is possible to reach CookieMiddleware object with it’s content directly from spider start_requests and parse methods (from crawler object as well as the most of ofther middlewares/moduels:

Unfortunately Crawler object doesn’t have methods to get middleware object from it’s name so it is possible with this… trick:

class Myspider(scrapy.Spider):
   def start_requests(self):
        downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
        self.CookieMiddleware= [middleware for middleware in downloader_middlewares if "CookieMiddleware" in str(type(middleware))][0]

With direct access to CookieJars from CookieMiddleware object and cookie_jar ID from response.meta - You already able to make any manipulations with sessions

  1. Scrapy provides no native capability for maintaing distinct profiles (client identities) within a single scrape.

Unfortunately this is true. By default scrapy use single CookieJar object for all requests. Single user agent from settings. +a lot of additional issues in case of multiple proxies used. The most of publicly available proxy rotaion modules for scrapy don’t create CookieJar per proxy - they are not session safe.

The idea of this tool is to manage distinct client identities within a scrape. The identity consists of two or more of the following attributes: session + user agent + proxy.

Some proxy providers already include session handling as service in addition to scraping proxies. In this case from that list - only proxy handling required from scrapy user.

For rest of cases. I agree that idea is actual.

from w3lib.http import basic_auth_header
PROFILES = [
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
    {"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
]

In order to bind proxy address to cookiejar it is enough to use the same key value for proxy and cookiejar request meta keys (no extra middlewares required) as I did on this gist code sample

Read more comments on GitHub >

github_iconTop Results From Across the Web

scrapy-sessions - PyPI
scrapy -sessions. A session-management extension for Scrapy. PyPI Version. Overview. This library resolves at least three long-standing issues in Scrapy's ...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc).
Read more >
Scrapy - how to manage cookies/sessions - Stack Overflow
scrapy -sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles ...
Read more >
Python Scrapy Login Forms: How To Log Into Any Website
... session/browser authenication, managing IP addresses. Luckily for us Scrapy developers, Scrapy provides us a whole suite of tools and extensions we can ......
Read more >
Working with COOKIES and HEADERS in Python SCRAPY ...
PROXY ROTATION crash course | Python + Requests & BeautifulSoup · Difference between cookies, session and tokens · Web Scraping - Live Coding...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found