[httpcache] FilesystemCacheStorage file organization
See original GitHub issue- Situation
- I use FilesystemCacheStorage, and have saved about 1M requests, 30GB data (not sure, I’m still waiting for counting …). And it can’t proceed reporting no disk space, but I still have 10+GB on disk.
- Interesting thing is that I can’t even use auto completion now.
-bash: cannot create temp file for here-document: No space left on device
- Comment
- I don’t know if it is a valid use case, I use filecache to save call to APIs to allow reprocessing the response as needed (to lower burden on service).
- I think it’s possible that fs has run out of inode. And the current implementation seem to create file for each request. Maybe can combine cache in save file.
- Or maybe the solution is to use database?
Issue Analytics
- State:
- Created 9 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
How to best handle Scrapy cache at 'OSError: [Errno 28] No ...
If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can...
Read more >Scrapy - Other Settings - Tutorialspoint
It is a class implementing the cache storage. Default value: 'scrapy.extensions.httpcache.FilesystemCacheStorage'. 32. HTTPERROR_ALLOWED_CODES.
Read more >Downloader Middleware — Scrapy 2.7.1 documentation
It's a light, low-level system for globally altering Scrapy's requests and ... File system storage backend is available for the HTTP cache middleware....
Read more >Website Scraping with Python - Ciência de Dados - 29 - Passei Direto
File System Storage If you enable HTTP caching, this is the default solution ... to your settings.py file: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.
Read more >How to identify Cache location and delete it ? - Google Groups
The default HTTP cache storage is one the filesystem: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@larryxiao Is it OK to close this issue?
If you would like to keep it open as a feature request, it might be a good idea to be more explicit about the desired change that, once implemented, should allow to close this issue.
Cache eviction might not be the responsibility of a running crawler.
Anyway, out of a crawer, a user can run a job to purge cache based on various policy (oldest first, response cache policy, last access).
eg, for “oldest first” policy, one can run :
find httpcache -maxdepth 3 -mindepth 3 -type d -mtime +30d
with-delete
flag.A hardest one would be “least recent access first” policy as access time is not an information stored by scrapy caching module and rarely stored by filesystems (atime) for performance reason.