question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Grabbing parameters from default requests

See original GitHub issue

scrapy_playwright tries to imitate the web-browser so it downloads all resources (images, scripts, stylesheets, etc). Therefore, given this information is downloaded, is it possible to grab the payload from specific requests sent from the network tab - specifically from the fetch/XHR tab?

For example (minimal reproducible code):

import scrapy
from scrapy_playwright.page import PageCoroutine

class testSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta={
                "playwright": True,
                "playwright_page_coroutines": [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                    PageCoroutine( "screenshot", path="scroll.png", full_page=True
                    ),
                ],
            },
        )
    def parse(self, response):
        pass

Produces the following output:

2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
...

Very similar to firefox which allows you to copy the url and url parameters. The aim is to store the url parameters into a list each time playwright downloads a specific request url. With a more complex website I would have to find specific requests urls and grab the url parameters.

Something like:

if response.meta['resource_type'] is 'xhr':
    print(parameters(response.meta['resource_type_urls'])) 

This is a pseduo-example to express what I want to get; parameters would be a function to grab the url parameters.

Or perhaps it works like this:

if response.meta['resource_type'] is 'xhr':
    print(response.meta['parameters']) 

However saving it into response.meta will likely overload the results if I have a large number of urls for resource types, and url parameters are fairly large dicts.

  • I’m convinced this data is available as it’s downloaded however I just do not know how to get it.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
lime-ncommented, Feb 27, 2022

Issue solved.

I have uploaded a working script implementing the results discussed in such cases that others may want to know how to mostly automate with scrapy playwright, and limit human intervention on the web-browser.

script

0reactions
lime-ncommented, Feb 27, 2022

That makes better sense, thank you for the clarification. I had not realised that page.on was not a coroutine when looking at the documents. Thanks for making this also clear to me.

I have successfully collected the urls as a result of your suggestion. Although, when you have the moment, where would the url parameters get stored? If the response stores the resource types, surely the url parameters must be stored somewhere also - as we get a response 200 for the resource type. For example, the quotes url should have the payload page:1 stored when it’s sending requests from the XHR for the first page. I can grab it from the url in this case, but in other cases the payload is not appended to the request url.

My objective to want this information:

  1. I can restrict the resource types as suggested by the following issue #26.
  2. By sending requests/responses to only these resource types, I can then grab their url’s using the event handlers
  3. If possible - grab the url parameters sent for each response to the resource types

Why? I can prevent any human effort to grab payloads, and requests urls required especially when an API is involved. So that I can scrape multiple sites directly without having to collect any of this information (or even without entering the browser as everything is in the json)

Furthermore, when trying to grab the urls I have noticed not all the XHR’s get a response (stops at page 3):

2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=1> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=1' method='GET'>)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=5> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=6> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=7> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=8> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=9> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=2> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=2' method='GET'>)
2022-02-27 10:14:02 [scrapy-playwright] DEBUG: [Context=default] Response: <200 http://quotes.toscrape.com/api/quotes?page=3> (referrer: None)
2022-02-27 10:14:02 [root] INFO: received response ('xhr', <Request url='http://quotes.toscrape.com/api/quotes?page=3' method='GET'>)
class EventSpider(Spider):
    name = "event"

    def start_requests(self):
        yield Request(
            url="http://quotes.toscrape.com/scroll",
            cookies={"foo": "bar", "asdf": "qwerty"},
            meta=dict(
                playwright=True,
                playwright_page_coroutines = [
                    PageCoroutine("wait_for_selector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
                playwright_page_event_handlers={
                    "response": "handle_response"
                },
            ),
        )

    async def handle_response(self, response: PlaywrightResponse) -> None:
        logging.info(f'received response {response.request.resource_type, response.request.url}')
        #return 

    def parse(self, response):
        return {"url": response.url}
Read more comments on GitHub >

github_iconTop Results From Across the Web

Capturing URL parameters in request.GET - django
When a URL is like domain/search/?q=haha , you would use request.GET.get('q', '') . q is the parameter you want, and '' is the...
Read more >
How to Get URL Parameters with JavaScript - SitePoint
Learn how to parse query string parameters and get their values in JavaScript. Use the results for tracking referrals, autocomplete, ...
Read more >
How To Retrieve URL and POST Parameters with Express
In this article, you will learn how to use Express to retrieve URL parameters and POST parameters from requests.
Read more >
Element Details: Default Request Parameters - Logi Analytics
Default Request Params establishes default values for @Request variables when a value is not passed in by the calling web page through the...
Read more >
How to Get The Request / Query Parameters in Symfony?
From the Request class (or the $request object), we want to access the request body parameters, or what you may simply know as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found