Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[httpcache] FilesystemCacheStorage file organization

See original GitHub issue

Situation
I use FilesystemCacheStorage, and have saved about 1M requests, 30GB data (not sure, I’m still waiting for counting …). And it can’t proceed reporting no disk space, but I still have 10+GB on disk.
Interesting thing is that I can’t even use auto completion now.

-bash: cannot create temp file for here-document: No space left on device

Comment
I don’t know if it is a valid use case, I use filecache to save call to APIs to allow reprocessing the response as needed (to lower burden on service).
I think it’s possible that fs has run out of inode. And the current implementation seem to create file for each request. Maybe can combine cache in save file.
Or maybe the solution is to use database?

Issue Analytics

State:
Created 9 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

Gallaeciocommented, Aug 14, 2019

@larryxiao Is it OK to close this issue?

If you would like to keep it open as a feature request, it might be a good idea to be more explicit about the desired change that, once implemented, should allow to close this issue.

0reactions

setopcommented, Nov 9, 2018

Cache eviction might not be the responsibility of a running crawler.

Anyway, out of a crawer, a user can run a job to purge cache based on various policy (oldest first, response cache policy, last access).

eg, for “oldest first” policy, one can run :

find httpcache -maxdepth 3 -mindepth 3 -type d -mtime +30d with -delete flag.

A hardest one would be “least recent access first” policy as access time is not an information stored by scrapy caching module and rarely stored by filesystems (atime) for performance reason.

Top Results From Across the Web

How to best handle Scrapy cache at 'OSError: [Errno 28] No ...

If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can...

Scrapy - Other Settings - Tutorialspoint

It is a class implementing the cache storage. Default value: 'scrapy.extensions.httpcache.FilesystemCacheStorage'. 32. HTTPERROR_ALLOWED_CODES.

Downloader Middleware — Scrapy 2.7.1 documentation

It's a light, low-level system for globally altering Scrapy's requests and ... File system storage backend is available for the HTTP cache middleware....

Website Scraping with Python - Ciência de Dados - 29 - Passei Direto

File System Storage If you enable HTTP caching, this is the default solution ... to your settings.py file: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.

How to identify Cache location and delete it ? - Google Groups

The default HTTP cache storage is one the filesystem: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'.