Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Grabbing parameters from default requests

See original GitHub issue

scrapy_playwright tries to imitate the web-browser so it downloads all resources (images, scripts, stylesheets, etc). Therefore, given this information is downloaded, is it possible to grab the payload from specific requests sent from the network tab - specifically from the fetch/XHR tab?

For example (minimal reproducible code):

import scrapy
from scrapy_playwright.page import PageCoroutine

class testSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta={
                "playwright": True,
                "playwright_page_coroutines": [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                    PageCoroutine( "screenshot", path="scroll.png", full_page=True
                    ),
                ],
            },
        )
    def parse(self, response):
        pass

Produces the following output:

2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
...

Very similar to firefox which allows you to copy the url and url parameters. The aim is to store the url parameters into a list each time playwright downloads a specific request url. With a more complex website I would have to find specific requests urls and grab the url parameters.

Something like:

if response.meta['resource_type'] is 'xhr':
    print(parameters(response.meta['resource_type_urls']))

This is a pseduo-example to express what I want to get; parameters would be a function to grab the url parameters.

Or perhaps it works like this:

if response.meta['resource_type'] is 'xhr':
    print(response.meta['parameters'])

However saving it into response.meta will likely overload the results if I have a large number of urls for resource types, and url parameters are fairly large dicts.

I’m convinced this data is available as it’s downloaded however I just do not know how to get it.

Issue Analytics

State:
Created 2 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

3reactions

lime-ncommented, Feb 27, 2022

Issue solved.

I have uploaded a working script implementing the results discussed in such cases that others may want to know how to mostly automate with scrapy playwright, and limit human intervention on the web-browser.

script

0reactions

lime-ncommented, Feb 27, 2022

That makes better sense, thank you for the clarification. I had not realised that page.on was not a coroutine when looking at the documents. Thanks for making this also clear to me.

I have successfully collected the urls as a result of your suggestion. Although, when you have the moment, where would the url parameters get stored? If the response stores the resource types, surely the url parameters must be stored somewhere also - as we get a response 200 for the resource type. For example, the quotes url should have the payload page:1 stored when it’s sending requests from the XHR for the first page. I can grab it from the url in this case, but in other cases the payload is not appended to the request url.

My objective to want this information:

I can restrict the resource types as suggested by the following issue #26.
By sending requests/responses to only these resource types, I can then grab their url’s using the event handlers
If possible - grab the url parameters sent for each response to the resource types

Why? I can prevent any human effort to grab payloads, and requests urls required especially when an API is involved. So that I can scrape multiple sites directly without having to collect any of this information (or even without entering the browser as everything is in the json)

Furthermore, when trying to grab the urls I have noticed not all the XHR’s get a response (stops at page 3):

2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=1> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=1' method='GET'>)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=5> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=6> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=7> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=8> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=9> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=2> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=2' method='GET'>)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=3> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=3' method='GET'>)

class EventSpider(Spider):
    name = "event"

    def start_requests(self):
        yield Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta=dict(
                playwright=True,
                playwright_page_coroutines = [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
                playwright_page_event_handlers={
                    "response": "handle_response"
                },
            ),
        )

    async def handle_response(self, response: PlaywrightResponse) -> None:
        logging.info(f'received response {response.request.resource_type, response.request.url}')
        #return 

    def parse(self, response):
        return {"url": response.url}