Order of calling close_spider in pipelines
See original GitHub issueprocess_item
is called based on the order of the pipeline classes mentioned in ITEM_PIPELINES
setting. But, close_spider
follows the exact opposite order with the close_spider
of the last pipeline getting called first [link].
I can’t think of a good reason for this. Should I submit a PR?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Scrapy pipeline spider_opened and spider_closed not being ...
I am having some trouble with a scrapy pipeline. My information is being scraped form sites ok and the process_item method is being...
Read more >Item Pipeline — Scrapy 2.7.1 documentation
This method is called when the spider is opened. Parameters. spider ( Spider object) – the spider which was opened. close_spider ...
Read more >CloseSpider from pipeline - Google Groups
You can close the spider by calling crawler.engine.close_spider() function, similar to how the CloseSpider builtin extension does it. Check its code:
Read more >Item Pipeline - Scrapy documentation - Read the Docs
This method is called for every item pipeline component and must either return a ... determine the order they run in- items go...
Read more >Scrapy - Item Pipeline - GeeksforGeeks
json) when spider starts crawling. close_spider() will be called to close the file when spider is closed and scraping is over. process_item() ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I made a bad choice in wording. For my use-case,
*pipeline A*
post items to a ES in batches, hence it needs to hold the items till it has enough of them to make a batch request. Once, all the items have been batch-posted by*pipeline A*
,*pipeline B*
does some stuff based on the ES indices affected by*pipeline A*
's actions.I totally agree with you about the backward compatibility point. I was thinking of some kind of setting that would allow the user to choose either having
close_spider
called in reverse order (as it is now), or else, have it called in the same order as that of the pipelines – a simple boolean setting would suffice for this. I would love to hear your thoughts and ideas about this.@kmike Sure, I totally see the benefits in the
open_spider
andprocess_item
cases. I am just not sure what benefit we get from callingB.close_spider
beforeA.close_spider
. Isn’t it more intuitive for all the functions to be serial? For example, a call order like,I stumbled upon this in a usage where, my
*pipeline A*
posts to ES and*pipeline B*
then works on the indices touched by*pipeline A*
. Say,*pipeline A*
stores a buffer of items, that is completely flushed on close spider, then if*pipeline B*
's close_spider is called before*pipeline A*
's close_spider, the datastore has not yet been updated with all the data and it appears like the overall spider pipeline is behaving like a bi-directional valve. Given that an ordering for pipelines exists, I think there should be a way to support both orderings at the least.Can you mention a use-case where you would have to have pipelines structured such that B would require being closed before A?