Refactor pipeline's design
See original GitHub issueCurrent Status and Issues
- We distinguish between query pipelines and indexing pipelines
- Query pipelines:
- Use
"Query"
as root node - Ouput:
List[Document]
orList[Answer]
- Access of
DocumentStore
throughRetriever
- Use
- Indexing pipelines:
- Use
"File"
as root node - Don’t have an output, but index
Documents
toDocumentStore
- Currently not very prominent
- not really used in Tutorials
convert_file_to_docs
utility function takes currently the role of indexing pipelines
- Explicit access of
DocumentStore
- Use
- Query pipelines:
- You can theoretically have more than one
DocumentStore
per Pipeline, but right now this possibility is not contemplated Pipeline
’srun
method takes basically anything -query
,documents
,file_paths
,labels
, …- Next to the generic
Pipeline
, we have a collection of “standard pipelines”- Currently, these are only query pipelines
- Purpose: Easing the use for people who aren’t aware of how Pipelines should look like for specific tasks, like a “recipe”
- Syntactic sugar: they don’t add anything to a simple Pipeline made of the same nodes.
- Due to the
"Query"
and"File"
root nodes, Pipelines can only be query or indexing: other approaches are really hard to implement- For example, quite complicated to build a summarization pipeline
Core ideas
-
Clarify the status of
DocumentStore
in Pipelines- Decision point: are Document stores nodes or not
- If they are, they need to behave like normal nodes
- If they are not, we need to clarify their API with respect to Pipelines.
- Direct access of DocumentStores by nodes should be made harder to avoid the DocStore-Retriever issue. Pipelines should mediate it.
-
Pipelines should always produce an output
- Currently indexing pipelines do not produce an output, while query pipelines do.
- In case of indexing pipelines, the output could be a list of
Document
objects that can be indexed - to clarify
-
Get rid of
convert_file_to_docs
utility function- Used in a lot of tutorials and should be exchanged by indexing pipelines
-
Get rid of
"File"
and"Query"
root node- We should stop classifying pipelines as indexing or query, but allow a more flexible structure.
- This would then be a generic
Pipeline
which should work easily with any use case - Additionally, we might have the subclasses
QueryPipeline(Pipeline)
andIndexingPipeline(Pipeline)
which would help users understand what’s required for querying and for indexing respectively.
-
Nodes should declare their input
- Right now nodes use the
run
method to declare their input. Mypy doesn’t like that, for a good reason - Nodes should be free to take the input they need, so either with **kwargs or with an object/dictionary,
- They need to declare a list of expected parameters to allow validation.
- Right now nodes use the
-
Distinction between
run
andrun_batch
is unnecessary- There should be only
run_batch
, as the currentrun
is simply a corner case where batch_size=1
- There should be only
-
Marginally related:
- https://www.notion.so/deepsetai/Ray-graph-api-49b8d35a4cd544caaa7ae396790b2926
- https://github.com/deepset-ai/haystack/issues/2403 Separate concepts of “Retriever” and “Embedder”
Action Plan
- Create a new
Pipeline
andBaseComponent
objects with basic features- Aims:
- Remove
Query
andIndexing
root nodes - Refactor the
run
method to always take generic, batched input - Implement input validation through class attributes
- Tackle dynamic outgoing edges management (#2850)
- Add support for circular pipeline graphs?
- Add support for
async
? https://github.com/deepset-ai/haystack/issues/2968 - Add support for nodes to output on several nodes at once: see https://github.com/deepset-ai/haystack/discussions/2972
- Remove
- Aims:
- Migrate split/join nodes
- Ensure loading from YAML and validation before loading are still possible
- Name the content of YAML files is a bonus (A pipeline “schema”/“group”/“blueprint”? A “configuration”? An “architecture”? A “manifest”?)
- Implement a new
RayPipeline
taking https://www.notion.so/deepsetai/Ray-graph-api-49b8d35a4cd544caaa7ae396790b2926 and https://github.com/ArzelaAscoIi/ray-graph-api into account - Introduce
QueryPipeline
andIndexingPipeline
- Phase out standard pipelines
- Replace the original
Pipeline
and migrate nodes- Heavy breaking change
- Cobble together some support for Docstores and Retrievers
- Define Docstores’ position in the pipeline by making the either normal nodes, or not nodes at all.
- Decouple Docstores and Retrievers
- Phase out
convert_file_to_docs
and replace with indexing pipelines wherever used
Each step comes with its own test refactoring.
Notes:
- Splitting the execution of a pipeline from the graph and the nodes themselves (see below and https://deepset-ai.slack.com/archives/C02HA67P97Y/p1659605858455789)
- We can take inspiration from the Rasa approach to Pipelines https://github.com/RasaHQ/rasa/blob/main/rasa/engine/graph.py
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Refactoring with Loops and Collection Pipelines
In this article I look at refactoring loops to collection pipelines with a series of small examples. 14 July 2015 ...
Read more >Refactoring JavaScript with pipeline style programming
Learn to refactor your JavaScript code by breaking it down into smaller functions that you can compose using what's called " pipeline style ......
Read more >How do you reorganize/refactor your pipeline? - Reddit
Trace up the pipeline to see at which transform to reorganize the data schema. At the pipe to rewrite the transform (point of...
Read more >Elixir Design Patterns - The Pipeline - Big Refactor
The Pipeline is defined by a collection of functions that take a data structure as an argument and return the same type of...
Read more >Refactor into discrete services pattern - IBM
Refactoring the functions of an application into reuseable components facilitates agile development and enables sharing by both existing and new applications.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry, that was rather a made up example. I thought that you might have different modes (indexing vs. inference) for the pipeline nodes but if it’s a completely separate pipeline we probably don’t need it.
It would be super cool if we could abstract the infrastructure which runs the pipeline. I’d imagine to have an interface
PipelineRunner
which gets aPipeline
.The
PipelineRunner
can then either run the pipeline locally or could e.g. also distribute its execution. For that it shouldn’t be required to inspect the individual pipeline nodes (pipeline nodes should be abstract components with arun
andindex
method + some initializer). Details such as whether something is a join or split node shouldn’t be relevant for thePipelineRunner
.That would be super cool for distributed execution as this currently means passing an initialized docstore around which is tough since they usually have open network connections which we can’t serialize and send around.