Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to clear all data which come from website and reset spider?

See original GitHub issue

I am crawling 100 urls in scrapy. I have found that if I start scrapy - crawl 1 url - kill scrapy, then I can crawl every url in a short time. However, if I crawl all the 100 url in the same scrapy process, then the website will find me and refuse me.

Therefore, the website must write some datas to my spider, so the latter request will fail. Is there some method to reset the status of spider just like restart the scrapy process? So that I can reset the status of spider as soon as I finish crawling every url.

Issue Analytics

State:
Created 6 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

djunzucommented, Oct 27, 2017

@kingname , no need to be sorry. We are always learning something and adapting our self to each community.

Back to your problem, I think your understanding about request/response cycle and scrapy itself is quite wrong. The web was built to be stateless. Most websites work with http and http is stateless. And http2 is stateless. As a workaround websites use cookies to allow them to maintain some kind of state between requests. So, if you disable cookies in your spider, there is absolutely no state between your requests. So, how would it be possible to clear all data which come from website and reset a spider if there is no such data in the first place??

There is no way to implement the feature you request simply because there is no state to clear.

But you are being blocked even though web is stateless… There are numerous techniques to analyze request log files and identify single users. In every request you may (and usually you do!) send a lot of information to the server in the form of a header request. If you are changing your UA and your IP in every request, the server is probably identifying you using other information. There are thousands ways to do it. (Did you disable cookies!?)

The problem you have in hands has nothing to do with scrapy itself. Scrapy is just a tool you use to make requests in the same way the browser is just a tool you use to make requests (and see content). A server/website can block you regarding the fact you are using scrapy or a browser.

Scrapy itself may have a bug that is messing with you? Yes. But if so, you have to first identify this bug so someone can fix it.

But again, you have a better chance of solving your problem going to SO. Or searching about how server identify/block users and how to avoid it.

0reactions

kingnamecommented, Oct 30, 2017

Thanks for your replay.