Grabbing parameters from default requests
See original GitHub issuescrapy_playwright
tries to imitate the web-browser so it downloads all resources (images, scripts, stylesheets, etc). Therefore, given this information is downloaded, is it possible to grab the payload from specific requests sent from the network tab - specifically from the fetch/XHR tab?
For example (minimal reproducible code):
import scrapy
from scrapy_playwright.page import PageCoroutine
class testSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/scroll",
cookies={"foo": "bar", "asdf": "qwerty"},
meta={
"playwright": True,
"playwright_page_coroutines": [
PageCoroutine("wait_for_selector", "div.quote"),
PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"), # 10 per page
PageCoroutine( "screenshot", path="scroll.png", full_page=True
),
],
},
)
def parse(self, response):
pass
Produces the following output:
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=2> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=3> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
2022-02-21 11:50:39 [scrapy-playwright] DEBUG: [Context=default] Request: <GET http://quotes.toscrape.com/api/quotes?page=4> (resource type: xhr, referrer: http://quotes.toscrape.com/scroll)
...
Very similar to firefox which allows you to copy the url and url parameters. The aim is to store the url parameters into a list each time playwright downloads a specific request url. With a more complex website I would have to find specific requests urls and grab the url parameters.
Something like:
if response.meta['resource_type'] is 'xhr':
print(parameters(response.meta['resource_type_urls']))
This is a pseduo-example to express what I want to get; parameters
would be a function to grab the url parameters.
Or perhaps it works like this:
if response.meta['resource_type'] is 'xhr':
print(response.meta['parameters'])
However saving it into response.meta
will likely overload the results if I have a large number of urls for resource types, and url parameters are fairly large dicts.
- I’m convinced this data is available as it’s downloaded however I just do not know how to get it.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
Issue solved.
I have uploaded a working script implementing the results discussed in such cases that others may want to know how to mostly automate with
scrapy playwright
, and limit human intervention on the web-browser.script
That makes better sense, thank you for the clarification. I had not realised that
page.on
was not a coroutine when looking at the documents. Thanks for making this also clear to me.I have successfully collected the urls as a result of your suggestion. Although, when you have the moment, where would the url parameters get stored? If the response stores the resource types, surely the url parameters must be stored somewhere also - as we get a response 200 for the resource type. For example, the quotes url should have the payload
page:1
stored when it’s sending requests from the XHR for the first page. I can grab it from the url in this case, but in other cases the payload is not appended to the request url.My objective to want this information:
Why? I can prevent any human effort to grab payloads, and requests urls required especially when an API is involved. So that I can scrape multiple sites directly without having to collect any of this information (or even without entering the browser as everything is in the json)
Furthermore, when trying to grab the urls I have noticed not all the XHR’s get a response (stops at page 3):