question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Refactor pipeline's design

See original GitHub issue

From https://www.notion.so/deepsetai/Re-base-4-Pipeline-Type-Fundamentals-88d305b5edec4bdb923ba7c3d63a8d5c

Current Status and Issues

  • We distinguish between query pipelines and indexing pipelines
    • Query pipelines:
      • Use "Query" as root node
      • Ouput: List[Document] or List[Answer]
      • Access of DocumentStore through Retriever
    • Indexing pipelines:
      • Use "File" as root node
      • Don’t have an output, but index Documents to DocumentStore
      • Currently not very prominent
        • not really used in Tutorials
        • convert_file_to_docs utility function takes currently the role of indexing pipelines
      • Explicit access of DocumentStore
  • You can theoretically have more than one DocumentStore per Pipeline, but right now this possibility is not contemplated
  • Pipeline’s run method takes basically anything - query, documents, file_paths, labels, …
  • Next to the generic Pipeline, we have a collection of “standard pipelines”
    • Currently, these are only query pipelines
    • Purpose: Easing the use for people who aren’t aware of how Pipelines should look like for specific tasks, like a “recipe”
    • Syntactic sugar: they don’t add anything to a simple Pipeline made of the same nodes.
  • Due to the "Query" and "File" root nodes, Pipelines can only be query or indexing: other approaches are really hard to implement
    • For example, quite complicated to build a summarization pipeline

Core ideas

  • Clarify the status of DocumentStore in Pipelines

    • Decision point: are Document stores nodes or not
    • If they are, they need to behave like normal nodes
    • If they are not, we need to clarify their API with respect to Pipelines.
    • Direct access of DocumentStores by nodes should be made harder to avoid the DocStore-Retriever issue. Pipelines should mediate it.
  • Pipelines should always produce an output

    • Currently indexing pipelines do not produce an output, while query pipelines do.
    • In case of indexing pipelines, the output could be a list of Document objects that can be indexed - to clarify
  • Get rid of convert_file_to_docs utility function

    • Used in a lot of tutorials and should be exchanged by indexing pipelines
  • Get rid of "File" and "Query" root node

    • We should stop classifying pipelines as indexing or query, but allow a more flexible structure.
    • This would then be a generic Pipeline which should work easily with any use case
    • Additionally, we might have the subclasses QueryPipeline(Pipeline) and IndexingPipeline(Pipeline) which would help users understand what’s required for querying and for indexing respectively.
  • Nodes should declare their input

    • Right now nodes use the run method to declare their input. Mypy doesn’t like that, for a good reason
    • Nodes should be free to take the input they need, so either with **kwargs or with an object/dictionary,
    • They need to declare a list of expected parameters to allow validation.
  • Distinction between run and run_batch is unnecessary

    • There should be only run_batch, as the current run is simply a corner case where batch_size=1
  • Marginally related:

Action Plan

  • Create a new Pipeline and BaseComponent objects with basic features
  • Migrate split/join nodes
  • Ensure loading from YAML and validation before loading are still possible
    • Name the content of YAML files is a bonus (A pipeline “schema”/“group”/“blueprint”? A “configuration”? An “architecture”? A “manifest”?)
  • Implement a new RayPipeline taking https://www.notion.so/deepsetai/Ray-graph-api-49b8d35a4cd544caaa7ae396790b2926 and https://github.com/ArzelaAscoIi/ray-graph-api into account
  • Introduce QueryPipeline and IndexingPipeline
  • Phase out standard pipelines
  • Replace the original Pipeline and migrate nodes
    • Heavy breaking change
    • Cobble together some support for Docstores and Retrievers
  • Define Docstores’ position in the pipeline by making the either normal nodes, or not nodes at all.
  • Decouple Docstores and Retrievers
  • Phase out convert_file_to_docs and replace with indexing pipelines wherever used

Each step comes with its own test refactoring.

Notes:

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
wochingecommented, Aug 5, 2022

Sorry for the nitpick, but what is the index method? Most node only have a run method (and run_batch but that’s a different story)

Sorry, that was rather a made up example. I thought that you might have different modes (indexing vs. inference) for the pipeline nodes but if it’s a completely separate pipeline we probably don’t need it.

1reaction
wochingecommented, Aug 4, 2022

It would be super cool if we could abstract the infrastructure which runs the pipeline. I’d imagine to have an interface PipelineRunner which gets a Pipeline.

The PipelineRunner can then either run the pipeline locally or could e.g. also distribute its execution. For that it shouldn’t be required to inspect the individual pipeline nodes (pipeline nodes should be abstract components with a run and index method + some initializer). Details such as whether something is a join or split node shouldn’t be relevant for the PipelineRunner.

Decouple Docstores and Retrievers

That would be super cool for distributed execution as this currently means passing an initialized docstore around which is tough since they usually have open network connections which we can’t serialize and send around.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Refactoring with Loops and Collection Pipelines
In this article I look at refactoring loops to collection pipelines with a series of small examples. 14 July 2015 ...
Read more >
Refactoring JavaScript with pipeline style programming
Learn to refactor your JavaScript code by breaking it down into smaller functions that you can compose using what's called " pipeline style ......
Read more >
How do you reorganize/refactor your pipeline? - Reddit
Trace up the pipeline to see at which transform to reorganize the data schema. At the pipe to rewrite the transform (point of...
Read more >
Elixir Design Patterns - The Pipeline - Big Refactor
The Pipeline is defined by a collection of functions that take a data structure as an argument and return the same type of...
Read more >
Refactor into discrete services pattern - IBM
Refactoring the functions of an application into reuseable components facilitates agile development and enables sharing by both existing and new applications.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found