question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

software-defined assets with dynamic mapping

See original GitHub issue

It could look something like this:

@mapped_asset(num_steps=100)
def my_mapped_asset(context, upstream_asset):
    mapping_key = context.mapping_key
    return some_function(upstream_asset["some_col" == mapping_key])

Considerations:

  • What if some of the steps are successful but other ones aren’t? Do we yield an AssetMaterialization?
  • Should the IO manager only load the chunk of the upstream asset that we’re processing within the step? How do we accomplish that?

Relevant discussions and requests:

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasPAcommented, Nov 8, 2022

Since I’d prefer my graph to show source_files -> table to show the two data structures, I’ve added an asset source_files that is used as the input of the graph backed asset table.

So visually it goes from a lonely asset in your example: image

To a more expressive graph: image

Exploded op graph view: image

Additionally, and it’s a bit out of scope here but can be useful, I’m using a sensor that may modify the list of files that the source_files outputs using an optional parameter and the newly added (🙏) possibility to specify a run_config in run_request_for_partition (use case is automated backfill of partial partitions).

import time
from typing import List

from dagster import (
    graph,
    AssetsDefinition,
    op,
    DynamicOut,
    DynamicOutput,
    repository,
    DailyPartitionsDefinition,
    Field,
    Array,
    asset,
    OpExecutionContext,
    Output,
    AssetSelection,
    define_asset_job,
    sensor,
)

DAILY_PARTITIONS = DailyPartitionsDefinition(start_date="2022-06-01")


@asset(
    description="Files to load",
    partitions_def=DAILY_PARTITIONS,
    config_schema={
        "selected_file_paths": Field(Array(str), is_required=False, default_value=[])
    },
)
def source_files(context: OpExecutionContext):
    selected_file_paths = context.op_config["selected_file_paths"]
    if selected_file_paths:
        context.log.info(f"Found selected file paths: {selected_file_paths}")
        file_paths = selected_file_paths
    else:
        context.log.info("Looking for paths matching the pattern.")
        file_paths = ["a", "b", "c"]
    return Output(file_paths)


@op(out=DynamicOut())
def output_files_dynamically(source_files):
    for file_name in source_files:
        yield DynamicOutput(mapping_key=file_name, value=file_name)


@op
def load_to_table(context, file_name):
    """Load the file into the table"""
    context.log.info(f"Loading to table for file {file_name}")
    return file_name


@op(description="Loaded table")
def merge(all_file_names: List[str]):
    """Merge all the files"""


@graph(name="table")
def load_files_to_table_graph(source_files):
    return merge(output_files_dynamically(source_files).map(load_to_table).collect())


table = AssetsDefinition.from_graph(
    load_files_to_table_graph,
    partitions_def=DAILY_PARTITIONS,
    keys_by_input_name={"source_files": source_files.asset_key},
)

load_files_to_table_job = define_asset_job(
    name="load_files_to_table_job",
    selection=AssetSelection.assets(source_files, table),
    partitions_def=DAILY_PARTITIONS,
)


@sensor(job=load_files_to_table_job)
def new_files_sensor():
    # some missing files detection logic, result in:
    new_files_partitions = [
        {"partition_date": "2022-11-05", "selected_file_paths": ["d", "e"]},
        {"partition_date": "2022-11-06", "selected_file_paths": ["f", "g"]},
    ]
    for new_files_partition in new_files_partitions:
        run_config = {
            "ops": {
                "source_files": {
                    "config": {
                        "selected_file_paths": new_files_partition[
                            "selected_file_paths"
                        ]
                    }
                }
            }
        }
        yield load_files_to_table_job.run_request_for_partition(
            partition_key=new_files_partition["partition_date"], run_config=run_config
        )


@repository
def repo():
    return [
        source_files,
        table,
        new_files_sensor,
    ]

A little quirk to note about the name and description that show up for the graph-backed asset on Dagit : while the name of the asset uses the name defined in the graph, the description that appears is the one from the last op of the graph, which is surprising, hence why my merge op has the “Loaded table” description. I think both name and description should be picked from the graph by default, and it should be possible to overwrite this default with optional arguments in AssetsDefinition.from_graph().

1reaction
sryzacommented, Oct 25, 2022

Oops you are right - fixed

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introducing Software-Defined Assets | Dagster Blog
This post introduces software-defined assets, a new, declarative approach to managing data and orchestrating its maintenance.
Read more >
Dynamic mapping | Elasticsearch Guide [master] | Elastic
The automatic detection and addition of new fields is called dynamic mapping. The dynamic mapping rules can be customized to suit your purposes...
Read more >
Network Mapping Software | Dynamic Map Tools | NetBrain
Demand for robust data center networks continues to increase as organizations struggle to achieve business agility for application deployments. This has led to ......
Read more >
DREAM: Dynamic Resource Allocation for Software-defined ...
The heart of DREAM is the per-switch resource allocator (Figure 3), which runs on the con- troller and maps TCAM counters to tasks...
Read more >
Dynamic data flow--mapping data between applications
This documentation applies to the 8.1 version of Service Request Management, which is in "End of Version Support.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found