Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

in built-in object IO managers, handle loading objects for multiple partitions

See original GitHub issue

If a daily asset depends on an hourly asset, then each partition of the daily asset will correspond to 24 partitions of the hourly asset.

When materializing the daily asset and load_input is called to load the contents of the hourly asset, we should return the contents of those 24 partitions.

For the Pandas type handler of the Snowflake IO manager, we handle this by returning a single dataframe that contains the concatenated contents of all the hourly partitions.

For the built-in object store IO managers, we could just return a list of the pickled objects. Or potentially a dictionary keyed by partition?

What we’ve heard:

2022-10-20: https://dagster.slack.com/archives/C01U954MEER/p1666175125167899

Here’s a start on the implementation:

from dagster import io_manager, IOManager, AssetKey


@io_manager
def my_io_manager():
    return MyIOManager()


class MyIOManager(IOManager):
    def handle_output(self, context, obj):
        ...

    def _load(self, asset_key: AssetKey):
        ...

    def _load_partition(self, asset_key: AssetKey, partition_key: str):
        ...

    def load_input(self, context):
        if context.has_asset_partitions:
            partition_key_range = context.asset_partition_key_range
            if partition_key_range.start == partition_key_range.end:
                return self._load_partition(context.asset_key, partition_key_range.start)
            else:
                return [
                    self._load_partition(context.asset_key, partition_key)
                    for partition_key in context.asset_partition_keys
                ]
        else:
            return self._load(context.asset_key)

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

sryzacommented, Oct 21, 2022

@sryza would a generator make more sense?

Interesting - this isn’t a pattern that we’ve used before with IO managers, but I think it makes sense.

Testing it using materialize_to_memory is proving tricky since all assets need to share the same partitions_def.

You could do something like this:

materialize([*hourly_asset.to_source_assets(), daily_asset], partition_key="2022-01-01")

That will run load_input for daily_asset without trying to materialize hourly_asset in the same run.

0reactions

sryzacommented, Oct 24, 2022

I also think a syntax to be more specific for higher frequency asset dependencies would be useful, but out of scope for this ticket. In some cases not all overlapping partitions are required, it would be nice to specify e.g. hourly_asset[‘-3H’] or something similar

You might already be aware of this, but you could define a custom PartitionMapping to express this

Top Results From Across the Web

IO Managers - Dagster Docs

IO Managers are user-provided objects that store asset and op outputs and load them as inputs to downstream assets and ops.

Data File Partitioning and Advanced Concepts of Hive

Static Partitioning in Hive You can create new partitions as needed, and define the new partitions using the ADD PARTITION clause. While ......

Data partitioning guidance - Azure Architecture Center

View guidance for how to separate data partitions to be managed and accessed separately. Understand horizontal, vertical, and functional partitioning ...

7 Understanding How to Use SQL*Loader - Oracle Help Center

Partitioned database objects enable you to manage sections of data, either collectively or individually. SQL*Loader supports loading partitioned objects.

Built-in metrics | Dynatrace Docs

Each Dynatrace-supported technology offers multiple "built-in" metrics. ... Cumulative Layout Shift - load action (by key user action, geolocation, ...