Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Allow multiple items through pipelines?

See original GitHub issue

The documentation for pipeline specifies that process_item must either return a dict with data, Item object or raise a DropItem exception. Is there a reason why we aren’t allowed to return an iterable of dictionaries with data (or Item objects)? It seems impossible to write a pipeline that modifies the input item and returns multiple items under the current framework.

Thank you!

Issue Analytics

State:
Created 7 years ago
Reactions:3
Comments:24 (10 by maintainers)

Top GitHub Comments

7reactions

kmikecommented, May 16, 2016

A pipeline has to be able to return either a single object, or an iterable collection of it. It also has to accept both, otherwise it wouldn’t make sense.

For me it makes more sense for pipeline to accept a single item, but return either a single item or a list/iterable. Before:

item1 --> [pipeline A] --> item1 --> [pipeline B] --> ...
item2 --> [pipeline A] --x raise DropItem()

New feature:

item3 --> [pipeline A],--> item4 --> [pipeline B] --> ...
                      '--> item5 --> [pipeline B] --> ...

3reactions

dxue2012commented, May 10, 2016

Hi all,

I ended up defining a custom post-processing step after ItemPipelines in the following manner:

Store all the items in MongoDB at the end of Pipelines
Read all items from MongoDB, and do more stuff on the data with “Processors”

Each processor (analogous to pipeline) defines a function called process_iter_items, which takes in an iterable of dicts, and must return an iterable of dicts. The set of processors is managed by BatchProcessorManager, a MiddlewareManager class similar to ItemPipelineManager, which supports the chaining of process_iter_items functions instead of process_item functions.

The chain of process_iter_items is connected to the signal emitted by the last ItemPipeline that stores the items in MongoDB.

Top Results From Across the Web

Scrapy, Python: Multiple Item Classes in one pipeline?

You can have one pipeline handle only one type of item request, though, if handling that item type is unique, by checking the...

Check out multiple repositories in your pipeline - Microsoft Learn

Pipelines often rely on multiple repositories that contain source, tools, scripts, or other items that you need to build your code.

Item Pipeline — Scrapy 2.7.1 documentation

Write items to a JSON lines file¶. The following pipeline stores all scraped items (from all spiders) into a single items.jsonl file, containing ......

Downstream pipelines - GitLab Docs

A pipeline in one project can trigger downstream pipelines in another project, called multi-project pipelines. The user triggering the upstream pipeline must be ......

Item Pipeline - Scrapy documentation - Read the Docs

After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that...