Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example when using LinkExtractor

See original GitHub issue

I’m trying to come up with a simple spider that takes a screenshot of every page.

In the example below the parse_item() is called for every link. But then the take_screenshot() is not called at all. Perhaps, I missed something in the documentation?

import hashlib
import logging
from typing import Generator

from scrapy.http import Response, Request
from scrapy.link import Link
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.http.response import Response

from scrapy_playwright.page import PageMethod

DEBUG = False
logging.getLogger("scrapy").propagate = DEBUG
logging.getLogger("filelock").propagate = False


class AwesomeSpider(CrawlSpider):
    name = "page"

    custom_meta = {
        "playwright": True,
        "playwright_context": 1,
        "playwright_include_page": True,
        "playwright_page_methods": [
            PageMethod("wait_for_load_state", "networkidle"),
        ],
    }

    def __init__(self, **kw):
        self.rules = (
            Rule(
                LinkExtractor(),
                process_links="process_links",
                callback="parse_item",
                follow=True,
            ),
        )

        super().__init__(**kw)

    def process_links(self, links: list[Link]) -> list[Link]:
        return links

    def parse_item(self, response: Response):
        yield Request(
            response.url,
            callback=self.take_screenshot,
            meta=self.custom_meta,
        )

    async def take_screenshot(self, response):
        print(response.url)
        page = response.meta.get("playwright_page")
        if page:
            url_sha256 = hashlib.sha256(response.url.encode("utf-8")).hexdigest()
            page = response.meta["playwright_page"]
            await page.screenshot(path=f"{url_sha256}.png", full_page=True)
            title = await page.title()
            await page.close()
            return {"url": response.url, "title": title}

    async def errback(self, failure):
        page = failure.request.meta.get("playwright_page")
        if page:
            await page.close()


if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,
            "CLOSESPIDER_ITEMCOUNT": 100,
            "TELNETCONSOLE_ENABLED": False,
            "FEEDS": {
                "data.json": {
                    "format": "json",
                    "encoding": "utf-8",
                    "indent": 4,
                },
            },
        }
    )

    process.crawl(
        AwesomeSpider,
        allowed_domains=["toscrape.com"],
        start_urls=["https://books.toscrape.com/"],
    )
    logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)
    logging.getLogger("scrapy.core.scraper").setLevel(logging.WARNING)
    process.start()

Issue Analytics

State:
Created a year ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sucreamcommented, Jun 24, 2022

Page.screenshot function’s argument is full_page not fullPage.

PageMethod("screenshot", path=f"{url_sha256}.png", full_page=True),

0reactions

elacuestacommented, Jul 28, 2022

All questions have been answered and there hasn’t been new activity in more than a month, closing.

Top Results From Across the Web

Link Extractors — Scrapy 2.7.1 documentation

A link extractor is an object that extracts links from responses. ... For example, you can instantiate LinkExtractor into a class variable in...

How to build Crawler, Rules and LinkExtractor in Python

How to use the new spider: CrawlSpider; What Rules and LinkExtractor are; Scrape the whole website without effort. Are you ready?

How to build Scrapy LinkExtractor with Parameters? - eduCBA

Scrapy LinkExtractor is an object which extracts the links from answers and is referred to as a link extractor. LxmlLinkExtractor's init method accepts ......

How to use the scrapy.linkextractors.LinkExtractor function in ...

To help you get started, we've selected a few scrapy.linkextractors.LinkExtractor examples, based on popular ways it is used in public projects.

Scrapy - Link Extractors - GeeksforGeeks

Link Extractor class of Scrapy ... So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a ...