Refactor pipeline's design

See original GitHub issue

From https://www.notion.so/deepsetai/Re-base-4-Pipeline-Type-Fundamentals-88d305b5edec4bdb923ba7c3d63a8d5c

Current Status and Issues

We distinguish between query pipelines and indexing pipelines
- Query pipelines:
  - Use "Query" as root node
  - Ouput: List[Document] or List[Answer]
  - Access of DocumentStore through Retriever
- Indexing pipelines:
  - Use "File" as root node
  - Don’t have an output, but index Documents to DocumentStore
  - Currently not very prominent
    - not really used in Tutorials
    - convert_file_to_docs utility function takes currently the role of indexing pipelines
  - Explicit access of DocumentStore
You can theoretically have more than one DocumentStore per Pipeline, but right now this possibility is not contemplated
Pipeline’s run method takes basically anything - query, documents, file_paths, labels, …
Next to the generic Pipeline, we have a collection of “standard pipelines”
- Currently, these are only query pipelines
- Purpose: Easing the use for people who aren’t aware of how Pipelines should look like for specific tasks, like a “recipe”
- Syntactic sugar: they don’t add anything to a simple Pipeline made of the same nodes.
Due to the "Query" and "File" root nodes, Pipelines can only be query or indexing: other approaches are really hard to implement
- For example, quite complicated to build a summarization pipeline

Core ideas

Clarify the status of DocumentStore in Pipelines
- Decision point: are Document stores nodes or not
- If they are, they need to behave like normal nodes
- If they are not, we need to clarify their API with respect to Pipelines.
- Direct access of DocumentStores by nodes should be made harder to avoid the DocStore-Retriever issue. Pipelines should mediate it.
Pipelines should always produce an output
- Currently indexing pipelines do not produce an output, while query pipelines do.
- In case of indexing pipelines, the output could be a list of Document objects that can be indexed - to clarify
Get rid of convert_file_to_docs utility function
- Used in a lot of tutorials and should be exchanged by indexing pipelines
Get rid of "File" and "Query" root node
- We should stop classifying pipelines as indexing or query, but allow a more flexible structure.
- This would then be a generic Pipeline which should work easily with any use case
- Additionally, we might have the subclasses QueryPipeline(Pipeline) and IndexingPipeline(Pipeline) which would help users understand what’s required for querying and for indexing respectively.
Nodes should declare their input
- Right now nodes use the run method to declare their input. Mypy doesn’t like that, for a good reason
- Nodes should be free to take the input they need, so either with **kwargs or with an object/dictionary,
- They need to declare a list of expected parameters to allow validation.
Distinction between run and run_batch is unnecessary
- There should be only run_batch, as the current run is simply a corner case where batch_size=1
Marginally related:
- https://www.notion.so/deepsetai/Ray-graph-api-49b8d35a4cd544caaa7ae396790b2926
- https://github.com/deepset-ai/haystack/issues/2403 Separate concepts of “Retriever” and “Embedder”

Action Plan

Create a new Pipeline and BaseComponent objects with basic features
- Aims:
  - Remove Query and Indexing root nodes
  - Refactor the run method to always take generic, batched input
  - Implement input validation through class attributes
  - Tackle dynamic outgoing edges management (#2850)
  - Add support for circular pipeline graphs?
  - Add support for async? https://github.com/deepset-ai/haystack/issues/2968
  - Add support for nodes to output on several nodes at once: see https://github.com/deepset-ai/haystack/discussions/2972
Migrate split/join nodes
- Related: https://github.com/deepset-ai/haystack/issues/2599
Ensure loading from YAML and validation before loading are still possible
- Name the content of YAML files is a bonus (A pipeline “schema”/“group”/“blueprint”? A “configuration”? An “architecture”? A “manifest”?)
Implement a new RayPipeline taking https://www.notion.so/deepsetai/Ray-graph-api-49b8d35a4cd544caaa7ae396790b2926 and https://github.com/ArzelaAscoIi/ray-graph-api into account
Introduce QueryPipeline and IndexingPipeline
Phase out standard pipelines
Replace the original Pipeline and migrate nodes
- Heavy breaking change
- Cobble together some support for Docstores and Retrievers
Define Docstores’ position in the pipeline by making the either normal nodes, or not nodes at all.
Decouple Docstores and Retrievers
Phase out convert_file_to_docs and replace with indexing pipelines wherever used

Each step comes with its own test refactoring.

Notes:

Splitting the execution of a pipeline from the graph and the nodes themselves (see below and https://deepset-ai.slack.com/archives/C02HA67P97Y/p1659605858455789)
We can take inspiration from the Rasa approach to Pipelines https://github.com/RasaHQ/rasa/blob/main/rasa/engine/graph.py

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

wochingecommented, Aug 5, 2022

Sorry for the nitpick, but what is the index method? Most node only have a run method (and run_batch but that’s a different story)

Sorry, that was rather a made up example. I thought that you might have different modes (indexing vs. inference) for the pipeline nodes but if it’s a completely separate pipeline we probably don’t need it.

1reaction

wochingecommented, Aug 4, 2022

It would be super cool if we could abstract the infrastructure which runs the pipeline. I’d imagine to have an interface PipelineRunner which gets a Pipeline.

The PipelineRunner can then either run the pipeline locally or could e.g. also distribute its execution. For that it shouldn’t be required to inspect the individual pipeline nodes (pipeline nodes should be abstract components with a run and index method + some initializer). Details such as whether something is a join or split node shouldn’t be relevant for the PipelineRunner.

Decouple Docstores and Retrievers

That would be super cool for distributed execution as this currently means passing an initialized docstore around which is tough since they usually have open network connections which we can’t serialize and send around.

Top Results From Across the Web

Refactoring with Loops and Collection Pipelines

In this article I look at refactoring loops to collection pipelines with a series of small examples. 14 July 2015 ...

Refactoring JavaScript with pipeline style programming

Learn to refactor your JavaScript code by breaking it down into smaller functions that you can compose using what's called " pipeline style ......

How do you reorganize/refactor your pipeline? - Reddit

Trace up the pipeline to see at which transform to reorganize the data schema. At the pipe to rewrite the transform (point of...

Elixir Design Patterns - The Pipeline - Big Refactor

The Pipeline is defined by a collection of functions that take a data structure as an argument and return the same type of...

Refactor into discrete services pattern - IBM

Refactoring the functions of an application into reuseable components facilitates agile development and enables sharing by both existing and new applications.