Example when using LinkExtractor
See original GitHub issueI’m trying to come up with a simple spider that takes a screenshot of every page.
In the example below the parse_item()
is called for every link. But then the take_screenshot()
is not called at all. Perhaps, I missed something in the documentation?
import hashlib
import logging
from typing import Generator
from scrapy.http import Response, Request
from scrapy.link import Link
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.http.response import Response
from scrapy_playwright.page import PageMethod
DEBUG = False
logging.getLogger("scrapy").propagate = DEBUG
logging.getLogger("filelock").propagate = False
class AwesomeSpider(CrawlSpider):
name = "page"
custom_meta = {
"playwright": True,
"playwright_context": 1,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_load_state", "networkidle"),
],
}
def __init__(self, **kw):
self.rules = (
Rule(
LinkExtractor(),
process_links="process_links",
callback="parse_item",
follow=True,
),
)
super().__init__(**kw)
def process_links(self, links: list[Link]) -> list[Link]:
return links
def parse_item(self, response: Response):
yield Request(
response.url,
callback=self.take_screenshot,
meta=self.custom_meta,
)
async def take_screenshot(self, response):
print(response.url)
page = response.meta.get("playwright_page")
if page:
url_sha256 = hashlib.sha256(response.url.encode("utf-8")).hexdigest()
page = response.meta["playwright_page"]
await page.screenshot(path=f"{url_sha256}.png", full_page=True)
title = await page.title()
await page.close()
return {"url": response.url, "title": title}
async def errback(self, failure):
page = failure.request.meta.get("playwright_page")
if page:
await page.close()
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,
"CLOSESPIDER_ITEMCOUNT": 100,
"TELNETCONSOLE_ENABLED": False,
"FEEDS": {
"data.json": {
"format": "json",
"encoding": "utf-8",
"indent": 4,
},
},
}
)
process.crawl(
AwesomeSpider,
allowed_domains=["toscrape.com"],
start_urls=["https://books.toscrape.com/"],
)
logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)
logging.getLogger("scrapy.core.scraper").setLevel(logging.WARNING)
process.start()
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Link Extractors — Scrapy 2.7.1 documentation
A link extractor is an object that extracts links from responses. ... For example, you can instantiate LinkExtractor into a class variable in...
Read more >How to build Crawler, Rules and LinkExtractor in Python
How to use the new spider: CrawlSpider; What Rules and LinkExtractor are; Scrape the whole website without effort. Are you ready?
Read more >How to build Scrapy LinkExtractor with Parameters? - eduCBA
Scrapy LinkExtractor is an object which extracts the links from answers and is referred to as a link extractor. LxmlLinkExtractor's init method accepts ......
Read more >How to use the scrapy.linkextractors.LinkExtractor function in ...
To help you get started, we've selected a few scrapy.linkextractors.LinkExtractor examples, based on popular ways it is used in public projects.
Read more >Scrapy - Link Extractors - GeeksforGeeks
Link Extractor class of Scrapy ... So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Page.screenshot
function’s argument isfull_page
notfullPage
.All questions have been answered and there hasn’t been new activity in more than a month, closing.