question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extend Jobs Functionality to allow for printf-style path configuration like Feed URIs

See original GitHub issue

Summary

With the JOBDIR setting the documentation states each spider should utilize its own directory, while there is nothing currently in place to automatically handle this as there is for the Feed URIs (through the use of printf-style path strings).

Motivation

Why are we doing this? What use cases does it support? What is the expected outcome? It was suggested by Scrapy’s maintainer on Reddit (post link) that I submit this Feature request because, as they said:

We support something like this in feed URIs, it should not be too hard to support it in JOBDIR

Describe alternatives you’ve considered

I tried to write my own implementation of the SpiderState extension by subclassing the existing one and it seems to work, but there’s one hang up. When using Scrapy’s shell to explore pages I constantly find myself running into errors with this implementation, stating struct.error: unpack requires a buffer of 4 bytes. I believe this is an error of the extension’s attempt to unpickle either an empty or non-existent .state file.

Code of self-made extension
from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

Additional context

I’m not sure if this is an issue or not with the way the StateSpider operates in general or not, but it was at least for my implementation. When using scrapy shell "url.com" I would receive errors that were resolved by repeatedly deleting the JOBDIR directory and re-running the shell command. Can this update include a fix for such behavior if it is indeed an issue with the original StateSpider class?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:10

github_iconTop GitHub Comments

1reaction
caffeinatedMikecommented, Aug 31, 2020

Any idea how much effort it will take to correctly implement this? My current attempt (shared within the original request) does not function properly as I seem to be missing a core aspect somewhere. For some reason the requests.queue folder and requests.seen file still end up being created within the base JOBDIR. So, I currently do not have a way to keep track of multiple spiders at one time.

0reactions
caffeinatedMikecommented, Nov 9, 2020

I’m just waiting until someone can manage to implement this feature.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Feed exports — Scrapy 2.7.1 documentation
Return a dict of key-value pairs to apply to the feed URI using printf-style string formatting. Specifically:
Read more >
Batch loading data | BigQuery - Google Cloud
Describes loading data into BigQuery from Cloud Storage or from a local file as a batch operation.
Read more >
CUPS Programming Manual
The goal is to allow an application to supply a print file in a standard format with the user intent ("print four copies,...
Read more >
Configuring Dispatcher | Adobe Experience Manager
Learn about support for IPv4 and IPv6, configuration files, ... /0041 { /type "allow" /extension '(css|gif|ico|js|png|swf|jpe?g)' } # Enable features /0062 ...
Read more >
C printf-Style Format Codes - L3HarrisGeospatial.com
The C printf-style format codes specify how data should be transferred using a format similar to that of the C printf function.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found