question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[httpcache] FilesystemCacheStorage file organization

See original GitHub issue
  • Situation
  • I use FilesystemCacheStorage, and have saved about 1M requests, 30GB data (not sure, I’m still waiting for counting …). And it can’t proceed reporting no disk space, but I still have 10+GB on disk.
  • Interesting thing is that I can’t even use auto completion now.
-bash: cannot create temp file for here-document: No space left on device
  • Comment
  • I don’t know if it is a valid use case, I use filecache to save call to APIs to allow reprocessing the response as needed (to lower burden on service).
  • I think it’s possible that fs has run out of inode. And the current implementation seem to create file for each request. Maybe can combine cache in save file.
  • Or maybe the solution is to use database?

Issue Analytics

  • State:closed
  • Created 9 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
Gallaeciocommented, Aug 14, 2019

@larryxiao Is it OK to close this issue?

If you would like to keep it open as a feature request, it might be a good idea to be more explicit about the desired change that, once implemented, should allow to close this issue.

0reactions
setopcommented, Nov 9, 2018

Cache eviction might not be the responsibility of a running crawler.

Anyway, out of a crawer, a user can run a job to purge cache based on various policy (oldest first, response cache policy, last access).

eg, for “oldest first” policy, one can run :

find httpcache -maxdepth 3 -mindepth 3 -type d -mtime +30d with -delete flag.

A hardest one would be “least recent access first” policy as access time is not an information stored by scrapy caching module and rarely stored by filesystems (atime) for performance reason.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to best handle Scrapy cache at 'OSError: [Errno 28] No ...
If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can...
Read more >
Scrapy - Other Settings - Tutorialspoint
It is a class implementing the cache storage. Default value: 'scrapy.extensions.httpcache.FilesystemCacheStorage'. 32. HTTPERROR_ALLOWED_CODES.
Read more >
Downloader Middleware — Scrapy 2.7.1 documentation
It's a light, low-level system for globally altering Scrapy's requests and ... File system storage backend is available for the HTTP cache middleware....
Read more >
Website Scraping with Python - Ciência de Dados - 29 - Passei Direto
File System Storage If you enable HTTP caching, this is the default solution ... to your settings.py file: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.
Read more >
How to identify Cache location and delete it ? - Google Groups
The default HTTP cache storage is one the filesystem: HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found