question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wrong step_context in IO manager's InputContext.upstream_output

See original GitHub issue

Summary

IO manager’s load_input gets wrong context.upstream_output.step_context. In the example below, when loading input val_a for op_b, I would expect context.upstream_output.step_context to be the context of op_a, that generated val_a asset. Instead, we get the context of op_b.

Reproduction

The following code reproduces the issue

import dagster

class CustomIOManager(dagster.IOManager):

    def handle_output(self, context, obj):
        pass

    def load_input(self, context):
        print(f"[load_input] input_name={context.name}, "
              f"op_config={context.upstream_output.step_context.op_config}")
        return 1

@dagster.io_manager
def custom_io_manager(_):
    return CustomIOManager()

@dagster.op
def op_a():
    return 1

@dagster.op
def op_b(val_a: int) -> int:
    return 2

@dagster.job(
    resource_defs={"io_manager": custom_io_manager},
)
def the_job():
    op_b(op_a())

run_config = {
    "ops": {
        "op_a": {"config": {"config_key": "A"}},
        "op_b": {"config": {"config_key": "B"}},
    }
}

the_job.execute_in_process(run_config=run_config)

I woud expect it to print

[load_input] input_name=val_a, op_config={'config_key': 'A'}

yet it prints

[load_input] input_name=val_a, op_config={'config_key': 'B'}

Further discussion

This example surfaces the problem through run_config, which gets delivered to IO manager through step_context. Here are some consequences of this problem:

Both mem_io_manager as well as fs_io_manager for assets, derive asset paths using context.upstream_output in load_input. Obtaining partition_key relies on op config, which is wrong in the upstream_output context. In the usual case, where partition_defs and output names in upstream and downstream assets is the same, then the relevant step_context is the same, and the problem is hidden. However, it becomes manifest in those circumstances:

  1. chain of assets with different partition_defs and a partition_mapping. Since partition_key will be different upstream and downstream, wrong step_context causes wrong partition_key choices.
  2. attempt to extend @multi_asset to support partitions. In this case, the output names may differ from upstream and downstream assets, which again surfaces the problem.

Possible fixes

I am happy to submit a PR for this. Potential fixes that I could come up:

  1. Construct an InputContext where upstream_output.step_context = None. Then adjust IO managers for assets to get partition_key from the input context, instead of context.upstream_output.
  2. Somehow, get the correct step_context for the upstream op.

Option 1 seems simpler, yet it can break people’s custom io managers. option 2 seems messy to implement, as when the InputContext is constructed here, we seem to have no access to the step_context’s for the ops that generate the inputs. Not sure if the extra complexity is justified.

Some feedback on how to proceed would be appreciated.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
sryzacommented, Jun 6, 2022

@aroig I’m going to close this because I believe you addressed the problem in other pull requests. Feel free to reopen if I’m misunderstanding.

0reactions
aroigcommented, May 18, 2022

@sryza #7958 is an attempt to address this. It still needs some more work though.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for dagster._core.execution.context.output
To construct an `OutputContext` for testing an IO Manager's ... "You are using InputContext.upstream_output.step_context" "This use on upstream_output is ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found