question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Large number of sub-directories in `requests.queue`

See original GitHub issue

Motivation

I am currently running a broad crawl on ~3 Million starting URLs using the suggested settings from this page. Since pause and resume features are necessary for my crawl, I used the JOBDIR project setting. This has been working great so far.

Recently I observed that the directory count of the requests.queue directory in the JOBDIR keeps on increasing in size. For example, I currently have ~800,000 sub-directories in requests.queue. I am concerned that a constant increase in size of this directory might lead to Inodes running out in my (relatively small) server’s partition.

Question

Is there any way to have scrapy automatically prune the requests.queue directory once the requests in this directory are complete?

PS: I am not well versed with how queuelib works in scrapy, so my apologies if this is a naive question.

Additional information

Upon closer inspection, I observed that the majority of sub-directories in requests.queue are either empty or contain empty files with a q* prefix; as described in #4842 as well. I am guessing that these correspond to completed requests. If this is correct, then these stale directories/files could be easily parsed and removed.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, Oct 13, 2021

To do what @atreyasha wants, which I think is for Scrapy to only keep those folders while they have content, we would need to go deeper than changing the close method. Those queues should not just create the folders when the queue is created and then remove them; they should create them while they have enqueued requests, remove them when they run out of requests, and then re-create them if they get more requests.

It’s probably not trivial to do, but it should be possible to implement. We can modify the existing queue classes or create new ones if the change is expected to have any significant performance hit (my guess would be no).

1reaction
LeMousselcommented, Oct 13, 2021

In squeues.py found this

_PickleFifoSerializationDiskQueue = _serializable_queue(
    _with_mkdir(queue.FifoDiskQueue),
    _pickle_serialize,
    pickle.loads
)

=> with_mkdir(queue.FifoDiskQueue) -> Create directory ?

but also in queuelib/queue.py

    def __init__(self, path: str, chunksize: int = 100000) -> None:
        self.path = path
        if not os.path.exists(path):
            os.makedirs(path)

=> os.makedirs(path) -> Create directory ?

In queue.FifoDiskQueue there is close method

Perhaps to create a FifoDiskQueue & LifoDiskQueue subclass that overrides the close method to handle the remove directory self.path?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Priority Queue | Introduction to Priority Queue
Priority Queue is an abstract data type that is similar to a queue, and every element has some priority value associated with it....
Read more >
How can I delete a folder with lots of subfolders fast?
It depends on your definition of fast. The answers already here give a good solution for actually removing the directories from the ...
Read more >
What is the best way to limit concurrency when using ES6's ...
The idea is that initially you send maximum allowed number of requests and each of these requests should recursively continue to send itself...
Read more >
Queues and messages in queues in Exchange Server
The poison message queue is typically empty. If the poison message queue contains no messages, then it doesn't appear in the queue management ......
Read more >
Request Queues - Adobe Experience League Community - 476519
I'd have one request queue for each of the items on your left most column. ... I am monitoring the Misc Group to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found