question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example when using LinkExtractor

See original GitHub issue

I’m trying to come up with a simple spider that takes a screenshot of every page.

In the example below the parse_item() is called for every link. But then the take_screenshot() is not called at all. Perhaps, I missed something in the documentation?

import hashlib
import logging
from typing import Generator

from scrapy.http import Response, Request
from scrapy.link import Link
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.http.response import Response

from scrapy_playwright.page import PageMethod

DEBUG = False
logging.getLogger("scrapy").propagate = DEBUG
logging.getLogger("filelock").propagate = False


class AwesomeSpider(CrawlSpider):
    name = "page"

    custom_meta = {
        "playwright": True,
        "playwright_context": 1,
        "playwright_include_page": True,
        "playwright_page_methods": [
            PageMethod("wait_for_load_state", "networkidle"),
        ],
    }

    def __init__(self, **kw):
        self.rules = (
            Rule(
                LinkExtractor(),
                process_links="process_links",
                callback="parse_item",
                follow=True,
            ),
        )

        super().__init__(**kw)

    def process_links(self, links: list[Link]) -> list[Link]:
        return links

    def parse_item(self, response: Response):
        yield Request(
            response.url,
            callback=self.take_screenshot,
            meta=self.custom_meta,
        )

    async def take_screenshot(self, response):
        print(response.url)
        page = response.meta.get("playwright_page")
        if page:
            url_sha256 = hashlib.sha256(response.url.encode("utf-8")).hexdigest()
            page = response.meta["playwright_page"]
            await page.screenshot(path=f"{url_sha256}.png", full_page=True)
            title = await page.title()
            await page.close()
            return {"url": response.url, "title": title}

    async def errback(self, failure):
        page = failure.request.meta.get("playwright_page")
        if page:
            await page.close()


if __name__ == "__main__":
    process = CrawlerProcess(
        settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "CONCURRENT_REQUESTS": 32,
            "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,
            "CLOSESPIDER_ITEMCOUNT": 100,
            "TELNETCONSOLE_ENABLED": False,
            "FEEDS": {
                "data.json": {
                    "format": "json",
                    "encoding": "utf-8",
                    "indent": 4,
                },
            },
        }
    )

    process.crawl(
        AwesomeSpider,
        allowed_domains=["toscrape.com"],
        start_urls=["https://books.toscrape.com/"],
    )
    logging.getLogger("scrapy.core.engine").setLevel(logging.WARNING)
    logging.getLogger("scrapy.core.scraper").setLevel(logging.WARNING)
    process.start()

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sucreamcommented, Jun 24, 2022

Page.screenshot function’s argument is full_page not fullPage.

PageMethod("screenshot", path=f"{url_sha256}.png", full_page=True),
0reactions
elacuestacommented, Jul 28, 2022

All questions have been answered and there hasn’t been new activity in more than a month, closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Link Extractors — Scrapy 2.7.1 documentation
A link extractor is an object that extracts links from responses. ... For example, you can instantiate LinkExtractor into a class variable in...
Read more >
How to build Crawler, Rules and LinkExtractor in Python
How to use the new spider: CrawlSpider; What Rules and LinkExtractor are; Scrape the whole website without effort. Are you ready?
Read more >
How to build Scrapy LinkExtractor with Parameters? - eduCBA
Scrapy LinkExtractor is an object which extracts the links from answers and is referred to as a link extractor. LxmlLinkExtractor's init method accepts ......
Read more >
How to use the scrapy.linkextractors.LinkExtractor function in ...
To help you get started, we've selected a few scrapy.linkextractors.LinkExtractor examples, based on popular ways it is used in public projects.
Read more >
Scrapy - Link Extractors - GeeksforGeeks
Link Extractor class of Scrapy ... So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found