How to clear all data which come from website and reset spider?
See original GitHub issueI am crawling 100 urls in scrapy. I have found that if I start scrapy
- crawl 1 url
- kill scrapy
, then I can crawl every url in a short time. However, if I crawl all the 100 url in the same scrapy process, then the website will find me and refuse me.
Therefore, the website must write some datas to my spider, so the latter request will fail. Is there some method to reset the status of spider
just like restart the scrapy process? So that I can reset the status of spider as soon as I finish crawling every url.
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Highscreen Spider - Factory reset and erase all data
Before doing a factory reset we recommend you delete all the accounts on the device, specifically it is useful to unlink the Google...
Read more >spyder - clear variable explorer along with ... - Stack Overflow
Go to the IPython console in the Spyder IDE and type %reset. ... Clear all variables before execution [Select Checkbox]
Read more >How to clear your browser cache - The Verge
Go to Firefox > Preferences > Privacy & Security > Cookies and Site Data. Click on Manage Data. · Scroll down to find...
Read more >Reseting Microsoft solitaire collection statistcs
Now in the right side, scroll the page down and look for “Microsoft Solitaire Collection” Click “Advanced Options”. Click the Reset button ...
Read more >Delete Cookies From One Website Only in Chrome - YouTube
Sometimes we need to delete individual site cookies that may have ... to remove it from the Chrome browser with losing all the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@kingname , no need to be sorry. We are always learning something and adapting our self to each community.
Back to your problem, I think your understanding about request/response cycle and scrapy itself is quite wrong. The web was built to be stateless. Most websites work with http and http is stateless. And http2 is stateless. As a workaround websites use cookies to allow them to maintain some kind of state between requests. So, if you disable cookies in your spider, there is absolutely no state between your requests. So, how would it be possible to clear all data which come from website and reset a spider if there is no such data in the first place??
There is no way to implement the feature you request simply because there is no state to clear.
But you are being blocked even though web is stateless… There are numerous techniques to analyze request log files and identify single users. In every request you may (and usually you do!) send a lot of information to the server in the form of a header request. If you are changing your UA and your IP in every request, the server is probably identifying you using other information. There are thousands ways to do it. (Did you disable cookies!?)
The problem you have in hands has nothing to do with scrapy itself. Scrapy is just a tool you use to make requests in the same way the browser is just a tool you use to make requests (and see content). A server/website can block you regarding the fact you are using scrapy or a browser.
Scrapy itself may have a bug that is messing with you? Yes. But if so, you have to first identify this bug so someone can fix it.
But again, you have a better chance of solving your problem going to SO. Or searching about how server identify/block users and how to avoid it.
Thanks for your replay.