question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to clear all data which come from website and reset spider?

See original GitHub issue

I am crawling 100 urls in scrapy. I have found that if I start scrapy - crawl 1 url - kill scrapy, then I can crawl every url in a short time. However, if I crawl all the 100 url in the same scrapy process, then the website will find me and refuse me.

Therefore, the website must write some datas to my spider, so the latter request will fail. Is there some method to reset the status of spider just like restart the scrapy process? So that I can reset the status of spider as soon as I finish crawling every url.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
djunzucommented, Oct 27, 2017

@kingname , no need to be sorry. We are always learning something and adapting our self to each community.

Back to your problem, I think your understanding about request/response cycle and scrapy itself is quite wrong. The web was built to be stateless. Most websites work with http and http is stateless. And http2 is stateless. As a workaround websites use cookies to allow them to maintain some kind of state between requests. So, if you disable cookies in your spider, there is absolutely no state between your requests. So, how would it be possible to clear all data which come from website and reset a spider if there is no such data in the first place??

There is no way to implement the feature you request simply because there is no state to clear.

But you are being blocked even though web is stateless… There are numerous techniques to analyze request log files and identify single users. In every request you may (and usually you do!) send a lot of information to the server in the form of a header request. If you are changing your UA and your IP in every request, the server is probably identifying you using other information. There are thousands ways to do it. (Did you disable cookies!?)

The problem you have in hands has nothing to do with scrapy itself. Scrapy is just a tool you use to make requests in the same way the browser is just a tool you use to make requests (and see content). A server/website can block you regarding the fact you are using scrapy or a browser.

Scrapy itself may have a bug that is messing with you? Yes. But if so, you have to first identify this bug so someone can fix it.

But again, you have a better chance of solving your problem going to SO. Or searching about how server identify/block users and how to avoid it.

0reactions
kingnamecommented, Oct 30, 2017

Thanks for your replay.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Highscreen Spider - Factory reset and erase all data
Before doing a factory reset we recommend you delete all the accounts on the device, specifically it is useful to unlink the Google...
Read more >
spyder - clear variable explorer along with ... - Stack Overflow
Go to the IPython console in the Spyder IDE and type %reset. ... Clear all variables before execution [Select Checkbox]
Read more >
How to clear your browser cache - The Verge
Go to Firefox > Preferences > Privacy & Security > Cookies and Site Data. Click on Manage Data. · Scroll down to find...
Read more >
Reseting Microsoft solitaire collection statistcs
Now in the right side, scroll the page down and look for “Microsoft Solitaire Collection” Click “Advanced Options”. Click the Reset button ...
Read more >
Delete Cookies From One Website Only in Chrome - YouTube
Sometimes we need to delete individual site cookies that may have ... to remove it from the Chrome browser with losing all the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found