[Question] Large number of sub-directories in `requests.queue`
See original GitHub issueMotivation
I am currently running a broad crawl on ~3 Million starting URLs using the suggested settings from this page. Since pause and resume features are necessary for my crawl, I used the JOBDIR
project setting. This has been working great so far.
Recently I observed that the directory count of the requests.queue
directory in the JOBDIR
keeps on increasing in size. For example, I currently have ~800,000 sub-directories in requests.queue
. I am concerned that a constant increase in size of this directory might lead to Inodes running out in my (relatively small) server’s partition.
Question
Is there any way to have scrapy
automatically prune the requests.queue
directory once the requests in this directory are complete?
PS: I am not well versed with how queuelib
works in scrapy
, so my apologies if this is a naive question.
Additional information
Upon closer inspection, I observed that the majority of sub-directories in requests.queue
are either empty or contain empty files with a q*
prefix; as described in #4842 as well. I am guessing that these correspond to completed requests. If this is correct, then these stale directories/files could be easily parsed and removed.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
To do what @atreyasha wants, which I think is for Scrapy to only keep those folders while they have content, we would need to go deeper than changing the close method. Those queues should not just create the folders when the queue is created and then remove them; they should create them while they have enqueued requests, remove them when they run out of requests, and then re-create them if they get more requests.
It’s probably not trivial to do, but it should be possible to implement. We can modify the existing queue classes or create new ones if the change is expected to have any significant performance hit (my guess would be no).
In squeues.py found this
=>
with_mkdir(queue.FifoDiskQueue)
-> Create directory ?but also in queuelib/queue.py
=>
os.makedirs(path)
-> Create directory ?In
queue.FifoDiskQueue
there isclose
methodPerhaps to create a
FifoDiskQueue
&LifoDiskQueue
subclass that overrides theclose
method to handle the remove directoryself.path
?