question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Order of calling close_spider in pipelines

See original GitHub issue

process_item is called based on the order of the pipeline classes mentioned in ITEM_PIPELINES setting. But, close_spider follows the exact opposite order with the close_spider of the last pipeline getting called first [link].

I can’t think of a good reason for this. Should I submit a PR?

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:1
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
debosmitcommented, Mar 16, 2017

I made a bad choice in wording. For my use-case, *pipeline A* post items to a ES in batches, hence it needs to hold the items till it has enough of them to make a batch request. Once, all the items have been batch-posted by *pipeline A*, *pipeline B* does some stuff based on the ES indices affected by *pipeline A*'s actions.

I totally agree with you about the backward compatibility point. I was thinking of some kind of setting that would allow the user to choose either having close_spider called in reverse order (as it is now), or else, have it called in the same order as that of the pipelines – a simple boolean setting would suffice for this. I would love to hear your thoughts and ideas about this.

2reactions
debosmitcommented, Mar 13, 2017

@kmike Sure, I totally see the benefits in the open_spider and process_item cases. I am just not sure what benefit we get from calling B.close_spider before A.close_spider. Isn’t it more intuitive for all the functions to be serial? For example, a call order like,

1. A.open_spider
2. B.open_spider
3. A.process_item
4. B.process_item
5. A.close_spider
6. B.close_spider

I stumbled upon this in a usage where, my *pipeline A* posts to ES and *pipeline B* then works on the indices touched by *pipeline A*. Say, *pipeline A* stores a buffer of items, that is completely flushed on close spider, then if *pipeline B*'s close_spider is called before *pipeline A*'s close_spider, the datastore has not yet been updated with all the data and it appears like the overall spider pipeline is behaving like a bi-directional valve. Given that an ordering for pipelines exists, I think there should be a way to support both orderings at the least.

Can you mention a use-case where you would have to have pipelines structured such that B would require being closed before A?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy pipeline spider_opened and spider_closed not being ...
I am having some trouble with a scrapy pipeline. My information is being scraped form sites ok and the process_item method is being...
Read more >
Item Pipeline — Scrapy 2.7.1 documentation
This method is called when the spider is opened. Parameters. spider ( Spider object) – the spider which was opened. close_spider ...
Read more >
CloseSpider from pipeline - Google Groups
You can close the spider by calling crawler.engine.close_spider() function, similar to how the CloseSpider builtin extension does it. Check its code:
Read more >
Item Pipeline - Scrapy documentation - Read the Docs
This method is called for every item pipeline component and must either return a ... determine the order they run in- items go...
Read more >
Scrapy - Item Pipeline - GeeksforGeeks
json) when spider starts crawling. close_spider() will be called to close the file when spider is closed and scraping is over. process_item() ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found