Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extend Jobs Functionality to allow for printf-style path configuration like Feed URIs

See original GitHub issue

Summary

With the JOBDIR setting the documentation states each spider should utilize its own directory, while there is nothing currently in place to automatically handle this as there is for the Feed URIs (through the use of printf-style path strings).

Motivation

Why are we doing this? What use cases does it support? What is the expected outcome? It was suggested by Scrapy’s maintainer on Reddit (post link) that I submit this Feature request because, as they said:

We support something like this in feed URIs, it should not be too hard to support it in JOBDIR

Describe alternatives you’ve considered

I tried to write my own implementation of the SpiderState extension by subclassing the existing one and it seems to work, but there’s one hang up. When using Scrapy’s shell to explore pages I constantly find myself running into errors with this implementation, stating struct.error: unpack requires a buffer of 4 bytes. I believe this is an error of the extension’s attempt to unpickle either an empty or non-existent .state file.

Code of self-made extension

from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

Additional context

I’m not sure if this is an issue or not with the way the StateSpider operates in general or not, but it was at least for my implementation. When using scrapy shell "url.com" I would receive errors that were resolved by repeatedly deleting the JOBDIR directory and re-running the shell command. Can this update include a fix for such behavior if it is indeed an issue with the original StateSpider class?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:10

Top GitHub Comments

1reaction

caffeinatedMikecommented, Aug 31, 2020

Any idea how much effort it will take to correctly implement this? My current attempt (shared within the original request) does not function properly as I seem to be missing a core aspect somewhere. For some reason the requests.queue folder and requests.seen file still end up being created within the base JOBDIR. So, I currently do not have a way to keep track of multiple spiders at one time.

0reactions

caffeinatedMikecommented, Nov 9, 2020

I’m just waiting until someone can manage to implement this feature.

Top Results From Across the Web

Feed exports — Scrapy 2.7.1 documentation

Return a dict of key-value pairs to apply to the feed URI using printf-style string formatting. Specifically:

Batch loading data | BigQuery - Google Cloud

Describes loading data into BigQuery from Cloud Storage or from a local file as a batch operation.

CUPS Programming Manual

The goal is to allow an application to supply a print file in a standard format with the user intent ("print four copies,...

Configuring Dispatcher | Adobe Experience Manager

Learn about support for IPv4 and IPv6, configuration files, ... /0041 { /type "allow" /extension '(css|gif|ico|js|png|swf|jpe?g)' } # Enable features /0062 ...

C printf-Style Format Codes - L3HarrisGeospatial.com

The C printf-style format codes specify how data should be transferred using a format similar to that of the C printf function.