question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

in built-in object IO managers, handle loading objects for multiple partitions

See original GitHub issue

If a daily asset depends on an hourly asset, then each partition of the daily asset will correspond to 24 partitions of the hourly asset.

When materializing the daily asset and load_input is called to load the contents of the hourly asset, we should return the contents of those 24 partitions.

For the Pandas type handler of the Snowflake IO manager, we handle this by returning a single dataframe that contains the concatenated contents of all the hourly partitions.

For the built-in object store IO managers, we could just return a list of the pickled objects. Or potentially a dictionary keyed by partition?

What we’ve heard:

Here’s a start on the implementation:

from dagster import io_manager, IOManager, AssetKey


@io_manager
def my_io_manager():
    return MyIOManager()


class MyIOManager(IOManager):
    def handle_output(self, context, obj):
        ...

    def _load(self, asset_key: AssetKey):
        ...

    def _load_partition(self, asset_key: AssetKey, partition_key: str):
        ...

    def load_input(self, context):
        if context.has_asset_partitions:
            partition_key_range = context.asset_partition_key_range
            if partition_key_range.start == partition_key_range.end:
                return self._load_partition(context.asset_key, partition_key_range.start)
            else:
                return [
                    self._load_partition(context.asset_key, partition_key)
                    for partition_key in context.asset_partition_keys
                ]
        else:
            return self._load(context.asset_key)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
sryzacommented, Oct 21, 2022

@sryza would a generator make more sense?

Interesting - this isn’t a pattern that we’ve used before with IO managers, but I think it makes sense.

Testing it using materialize_to_memory is proving tricky since all assets need to share the same partitions_def.

You could do something like this:

materialize([*hourly_asset.to_source_assets(), daily_asset], partition_key="2022-01-01")

That will run load_input for daily_asset without trying to materialize hourly_asset in the same run.

0reactions
sryzacommented, Oct 24, 2022

I also think a syntax to be more specific for higher frequency asset dependencies would be useful, but out of scope for this ticket. In some cases not all overlapping partitions are required, it would be nice to specify e.g. hourly_asset[‘-3H’] or something similar

You might already be aware of this, but you could define a custom PartitionMapping to express this

Read more comments on GitHub >

github_iconTop Results From Across the Web

IO Managers - Dagster Docs
IO Managers are user-provided objects that store asset and op outputs and load them as inputs to downstream assets and ops.
Read more >
Data File Partitioning and Advanced Concepts of Hive
Static Partitioning in Hive​​ You can create new partitions as needed, and define the new partitions using the ADD PARTITION clause. While  ......
Read more >
Data partitioning guidance - Azure Architecture Center
View guidance for how to separate data partitions to be managed and accessed separately. Understand horizontal, vertical, and functional partitioning ...
Read more >
7 Understanding How to Use SQL*Loader - Oracle Help Center
Partitioned database objects enable you to manage sections of data, either collectively or individually. SQL*Loader supports loading partitioned objects.
Read more >
Built-in metrics | Dynatrace Docs
Each Dynatrace-supported technology offers multiple "built-in" metrics. ... Cumulative Layout Shift - load action (by key user action, geolocation, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found