Extend Jobs Functionality to allow for printf-style path configuration like Feed URIs
See original GitHub issueSummary
With the JOBDIR
setting the documentation states each spider should utilize its own directory, while there is nothing currently in place to automatically handle this as there is for the Feed URIs (through the use of printf-style path strings).
Motivation
Why are we doing this? What use cases does it support? What is the expected outcome? It was suggested by Scrapy’s maintainer on Reddit (post link) that I submit this Feature request because, as they said:
We support something like this in feed URIs, it should not be too hard to support it in JOBDIR
Describe alternatives you’ve considered
I tried to write my own implementation of the SpiderState extension by subclassing the existing one and it seems to work, but there’s one hang up. When using Scrapy’s shell to explore pages I constantly find myself running into errors with this implementation, stating struct.error: unpack requires a buffer of 4 bytes
. I believe this is an error of the extension’s attempt to unpickle either an empty or non-existent .state
file.
Code of self-made extension
from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os
class SpiderStateManager(SpiderState):
"""
SpiderState Purpose: Store and load spider state during a scraping job
Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
Reasoning: Reduces repetitive code
Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
"""
def __init__(self, jobdir=None):
self.jobdir = jobdir
super(SpiderStateManager, self).__init__(jobdir=self.jobdir)
@classmethod
def from_crawler(cls, crawler):
base_jobdir = crawler.settings['JOBDIR']
if not base_jobdir:
raise NotConfigured
spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
if not os.path.exists(spider_jobdir):
os.makedirs(spider_jobdir)
obj = cls(spider_jobdir)
crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
return obj
Additional context
I’m not sure if this is an issue or not with the way the StateSpider
operates in general or not, but it was at least for my implementation. When using scrapy shell "url.com"
I would receive errors that were resolved by repeatedly deleting the JOBDIR
directory and re-running the shell command. Can this update include a fix for such behavior if it is indeed an issue with the original StateSpider
class?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:10
Top GitHub Comments
Any idea how much effort it will take to correctly implement this? My current attempt (shared within the original request) does not function properly as I seem to be missing a core aspect somewhere. For some reason the
requests.queue
folder andrequests.seen
file still end up being created within the baseJOBDIR
. So, I currently do not have a way to keep track of multiple spiders at one time.I’m just waiting until someone can manage to implement this feature.