question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve syntax for op order dependencies

See original GitHub issue

Assume we have solid_a and solid_b. If we want solid_b to execute after solid_a, we can composition functions to express the ordering:

@pipeline
def pipe():
    solid_b(solid_a)

However, if solid_b doesn’t depend on solid_a’s outputs, we need to define solid_a’s input_defs as something like:

input_defs=[InputDefinition(_START, Nothing)]

We could probably come up with a way to express order dependencies in the composition function syntax, or at least create a simple alias for InputDefinition(_START, Nothing) to make it easier to understand.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:10
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

9reactions
basilvetascommented, Apr 2, 2020

I’ve run into some implementation challenges related to this issue – spoke with some of your team in Slack and was encouraged to provide my use case as an example here.

So I’m building an ELT pipeline, and as a convention have set up my pipeline to have three stages: extract, load, transform. I’m using Nothing passed between them to create the dependency structure. My pipeline looks like this:

@pipeline(
    mode_defs=mode_defs(),
    preset_defs=preset_defs(),
)
def example_pipeline():
    transform(depends_on=[
        load(depends_on=[
            extract()
        ])
    ])

In general, “extract” means extract data from original source into an S3 staging bucket, “load” means copy data from S3 staging bucket into Redshift within my “raw data” schema, and “transform” means move data from “raw data” schema into “production ready” schema while performing necessary cleaning, filtering, transforms along the way. I’m primarily using dbt models via dagster-dbt in the transform stage.

Each of the three solids - extract, load, transform - are defined as a composite_solid, allowing me to break these stages down into child solids within each. (I’ve established these conventions for consistency because I’ll eventually be building more pipelines for different data sources/other developers will be building pipelines etc)

I didn’t want to have to map the Nothing input into my child solids because many of my child solids are reusable utils, and I didn’t want to have to change the solid input definition just to map Nothing for this specific use case. So, as a workaround I wrote this solid to map my Nothing to within each composite_solid:

@solid(input_defs=[InputDefinition(name='depends_on', dagster_type=Nothing)])
def do_nothing(context) -> Nothing:
    return

My three stages basically look like this:

@composite_solid(
    input_defs=[InputDefinition(name='depends_on', dagster_type=Nothing)]
)
def extract(depends_on: Nothing) -> Nothing:

	# child solids extract stuff...

    return do_nothing(depends_on=depends_on)


@composite_solid(
    input_defs=[InputDefinition(name='depends_on', dagster_type=Nothing)]
)
def load(depends_on: Nothing) -> Nothing:

	# child solids load stuff...

    return do_nothing(depends_on=depends_on)


@composite_solid(
    input_defs=[InputDefinition(name='depends_on', dagster_type=Nothing)]
)
def transform(depends_on: Nothing) -> Nothing:

	# child solids transform stuff...

    return do_nothing(depends_on=depends_on)

The challenge that I’m still running into is that because I haven’t mapped the Nothing input to my child solids, the dependency structure is only enforced for the do_nothing solids, but isn’t enforced for the child solids that actually do stuff (i.e. my dbt model in transform starts executing before load has finished copying data). It sounds like I could go back and map Nothing into my custom util child solids to enforce the dependency structure, however, in the case of my dbt models I’m not sure what to do because when using the create_dbt_run_solid from dagster-dbt I can’t change the solid definition in order to map the Nothing input.

I’m still learning Dagster so please let me know if my understanding of things is incorrect/incomplete – there may be simple solutions that I’ve missed, and I may be abusing the system a bit. If that is the case please let me know how you guys would suggest to resolve these issues.

In general though, as a user, I would ideally like a way to establish my dependency structure between my composite solids, and be guaranteed that the structure will be enforced across the child solids without having to do this extra layer of Nothing mapping to child solids. Additionally, my do_nothing solid feels hacky and long term I’d love a way to do away with that.

I can’t speak to the implementation challenges in your system, but as a user I think something more semantic than passing Nothing for establishing dependency structures would definitely be welcome. In my specific use case, it seems like the notion of my “extract” stage being “done” or my “load” stage being “done” really corresponds to “some asset exists in S3” or “some table has been populated in Redshift” etc and these are the real notions that should be used to establish dependencies/preconditions. Just thinking out loud now. Hope my example is helpful in brainstorming solutions and please let me know if there is anything I can do to help. Have really enjoyed using Dagster thus far!

1reaction
mgasnercommented, Oct 22, 2019
(a, b) = solid_a()
solid_a_done = solid_a.done()
solid_b(a)
solid_c(a, b)
solid_d(solid_a_done)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Define the order for deploying resources in ARM templates
Describes how to set one Azure resource as dependent on another resource during deployment. The dependencies ensure resources are deployed ...
Read more >
Improving Relation Extraction through Syntax-induced Pre ...
Relation extraction (RE) is an important natu- ral language processing task that predicts the relation between two given entities, where a.
Read more >
7. Declaring relationships between packages - Debian
Declaring relationships between packages¶. 7.1. Syntax of relationship fields¶. These fields all have a uniform syntax. They are a list of package names ......
Read more >
A Beginner's Guide to the True Order of SQL Operations
“poor Java guy” – you really think that Java syntax is much better? :) Why should WHERE+HAVING be against the SQL “idea” (i.e....
Read more >
Create Resource Dependencies | Terraform
Create an implicit dependency between an EC2 instance and its Elastic IP ... Terraform provisions your resources in order, and reports on its...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found