Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

software-defined assets with dynamic mapping

See original GitHub issue

It could look something like this:

@mapped_asset(num_steps=100)
def my_mapped_asset(context, upstream_asset):
    mapping_key = context.mapping_key
    return some_function(upstream_asset["some_col" == mapping_key])

Considerations:

What if some of the steps are successful but other ones aren’t? Do we yield an AssetMaterialization?
Should the IO manager only load the chunk of the upstream asset that we’re processing within the step? How do we accomplish that?

Relevant discussions and requests:

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

NicolasPAcommented, Nov 8, 2022

Since I’d prefer my graph to show source_files -> table to show the two data structures, I’ve added an asset source_files that is used as the input of the graph backed asset table.

So visually it goes from a lonely asset in your example:

To a more expressive graph:

Exploded op graph view:

Additionally, and it’s a bit out of scope here but can be useful, I’m using a sensor that may modify the list of files that the source_files outputs using an optional parameter and the newly added (🙏) possibility to specify a run_config in run_request_for_partition (use case is automated backfill of partial partitions).

import time
from typing import List

from dagster import (
    graph,
    AssetsDefinition,
    op,
    DynamicOut,
    DynamicOutput,
    repository,
    DailyPartitionsDefinition,
    Field,
    Array,
    asset,
    OpExecutionContext,
    Output,
    AssetSelection,
    define_asset_job,
    sensor,
)

DAILY_PARTITIONS = DailyPartitionsDefinition(start_date="2022-06-01")


@asset(
    description="Files to load",
    partitions_def=DAILY_PARTITIONS,
    config_schema={
        "selected_file_paths": Field(Array(str), is_required=False, default_value=[])
    },
)
def source_files(context: OpExecutionContext):
    selected_file_paths = context.op_config["selected_file_paths"]
    if selected_file_paths:
        context.log.info(f"Found selected file paths: {selected_file_paths}")
        file_paths = selected_file_paths
    else:
        context.log.info("Looking for paths matching the pattern.")
        file_paths = ["a", "b", "c"]
    return Output(file_paths)


@op(out=DynamicOut())
def output_files_dynamically(source_files):
    for file_name in source_files:
        yield DynamicOutput(mapping_key=file_name, value=file_name)


@op
def load_to_table(context, file_name):
    """Load the file into the table"""
    context.log.info(f"Loading to table for file {file_name}")
    return file_name


@op(description="Loaded table")
def merge(all_file_names: List[str]):
    """Merge all the files"""


@graph(name="table")
def load_files_to_table_graph(source_files):
    return merge(output_files_dynamically(source_files).map(load_to_table).collect())


table = AssetsDefinition.from_graph(
    load_files_to_table_graph,
    partitions_def=DAILY_PARTITIONS,
    keys_by_input_name={"source_files": source_files.asset_key},
)

load_files_to_table_job = define_asset_job(
    name="load_files_to_table_job",
    selection=AssetSelection.assets(source_files, table),
    partitions_def=DAILY_PARTITIONS,
)


@sensor(job=load_files_to_table_job)
def new_files_sensor():
    # some missing files detection logic, result in:
    new_files_partitions = [
        {"partition_date": "2022-11-05", "selected_file_paths": ["d", "e"]},
        {"partition_date": "2022-11-06", "selected_file_paths": ["f", "g"]},
    ]
    for new_files_partition in new_files_partitions:
        run_config = {
            "ops": {
                "source_files": {
                    "config": {
                        "selected_file_paths": new_files_partition[
                            "selected_file_paths"
                        ]
                    }
                }
            }
        }
        yield load_files_to_table_job.run_request_for_partition(
            partition_key=new_files_partition["partition_date"], run_config=run_config
        )


@repository
def repo():
    return [
        source_files,
        table,
        new_files_sensor,
    ]

A little quirk to note about the name and description that show up for the graph-backed asset on Dagit : while the name of the asset uses the name defined in the graph, the description that appears is the one from the last op of the graph, which is surprising, hence why my merge op has the “Loaded table” description. I think both name and description should be picked from the graph by default, and it should be possible to overwrite this default with optional arguments in AssetsDefinition.from_graph().

1reaction

sryzacommented, Oct 25, 2022

Oops you are right - fixed

Top Results From Across the Web

Introducing Software-Defined Assets | Dagster Blog

This post introduces software-defined assets, a new, declarative approach to managing data and orchestrating its maintenance.

Dynamic mapping | Elasticsearch Guide [master] | Elastic

The automatic detection and addition of new fields is called dynamic mapping. The dynamic mapping rules can be customized to suit your purposes...

Network Mapping Software | Dynamic Map Tools | NetBrain

Demand for robust data center networks continues to increase as organizations struggle to achieve business agility for application deployments. This has led to ......

DREAM: Dynamic Resource Allocation for Software-defined ...

The heart of DREAM is the per-switch resource allocator (Figure 3), which runs on the con- troller and maps TCAM counters to tasks...

Dynamic data flow--mapping data between applications

This documentation applies to the 8.1 version of Service Request Management, which is in "End of Version Support.