question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Runtime asset partitions

See original GitHub issue

I’ve seen a couple cases recently where it would be helpful to have “runtime” asset partitions. I.e. a partition is added to an asset when a job runs, rather than at definition time.

What we’ve heard

How can I handle a data asset in dagster which is partitioned - but there is no clear notion i.e. date when an update occurs? For example, let`s assume some external (public) dataset like https://data.world/buffalony/ru4s-wz29 or https://data.buffalony.gov/Quality-of-Life/Recyclable-Materials/ru4s-wz29 should be polled for changes (once every day/week/whatever) and if fresh data is available, the data is downloaded and stored in a data warehouse for this specific asset in a new partition. I.e. how can I harmonize the notion of a partitioned asset with the daily trigger to check for updates and an asset which might update only every now-and-then?

The use case I’m looking to support is to partition my assets by our customers (as we run manage many assets in a single tenant manner), I would imagine any user wanting to manage software-defined assets in a single tenant manner would stumble on the same blocker. I know there is some conflict between the defined/deterministic nature of software assets and dynamic partitioning, however I feel this is probably a relatively common use case.

We have discussed having a static config and generating the assets but our customer base is not static (and infact we actually partition again inside a particular customer).

We would effectively have to redeploy the entire repository anytime we created a new customer, which in the long run is unsustainable.

Are there plans to support dynamically partitioned Assets in the future or a way to do the following with current features? In essence, I need to process experiment results and would like to use partitioned assets for that. Every few days (inconsistent) an experiment result will be made available for Dagster to process, and I would like to handle each result as a separate partition. The list of partitions would have to grow dynamically and be re-fetched before kicking off runs for missing partitions.

I am facing a similar use case. We have external service that generate datasets, and the dagster pipeline is to pick up each dataset and run through a set of ops.

  • tga

hi, I am trying to pass input files (file key really) through a processing pipeline, generating the original and each intermediate step as an asset (with keys like file1_step1, file1_step2) so I have an ingest job that takes the file key and applies the graph, and the steps are called correctly, I’m just not sure how to use assets in this case

as far as I can tell, the docs treat software defined assets as a fixed thing, with a key given by the function name – not a template for generating dynamic assets based on input (edited)

that’s the part that kills it for me – if I come back later and materialize one of those virtual assets, it’s not possible (I think) to automatically rematerialize downstream assets that depend on it, because they’re not linked – so what’s the value of having assets in the first place then?

there are a few assets that it’s helpful for me to write custom logic for, but mostly it’s just the same logic over and over again on a different S3 prefix. not sure if I’m misunderstanding something fundamental but none of the docs seem to really cover it

the data I’m working with is sets of microscope images that are pretty diverse in their context (in terms of acquisition, subject, etc.) but for which there are also some standard algorithms that can be run over them. I’m not sure what the implications of trying to group them all together into a single “image” asset and then attempting to partition them appropriately would mean for us

Briefly, I have a pipeline for processing the files in a manifest, and the number of manifests will grow over time. Each manifest should be processed only once unless it’s updated. I’m wondering how to best handle the dynamic nature of my inputs (i.e. the file manifests).

Hello, I would appreciate if someone could point me in the right direction here as I am new to Dagster. I want to create a new asset (text file 2) from a source asset (text file 1). Text file 1 is stored in AWS s3 and I also want to store text file 2 in AWS s3. The files are named by a unique ID (e.g. id123456.txt), and new files show up in AWS s3 daily. The id also shows up in a database before the file shows up in AWS s3. I would like to be able to create the new asset (text file 2) from all existing source assets (text file 1) and from any new source assets (text file 1) that show up each day. Could anyone describe to me how I should be thinking about this? In my head I am thinking that I should start by using the ids in the database table to define a partitioned asset? (edited)

Design / implementation

Two flavors of this we could imagine:

  • The partition is provided at the time the run or step is kicked off. This would mean that you’d still be able to “backfill” over these partitions, as well as include them in the ASSET_MATERIALIZATION_PLANNED events.
  • The partition is determined by the contents of an op or IO manager.

The value of modeling these as partitions, vs., say, just config:

  • You can see the history of materializations for a particular partition
  • You can track partition-level lineage
  • You can view the list of all the partitions

You could imagine an interaction between this and dynamic mapping - where each mapped step corresponds to one of the dynamic asset partitions.

We might want a special PartitionsDefinition for this. E.g.

@asset(partitions_def=RuntimePartitionsDefinition())
def my_asset():
    ...

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:20
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
nickvazzcommented, Nov 14, 2022

To add another use case from a question I asked on slack

Is it possible to add a partition when a job is created? If I had a job that had a dynamic partitioning scheme that was based off of buckets sub-directories that exist, and the job can/will create one of those sub-directories on its first run, can the job when it runs add an additional key to that dynamic partition?

2reactions
kevinjdolancommented, Sep 16, 2022

An alternative for my particular use-case would be able to chain multiple partition defs together such that all combinations are produced.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Asset Materializations - Dagster Docs
Asset Materializations, on the other hand, are logged at run time. When you run an op, you find out which assets were materialized...
Read more >
World Partition - Unreal Engine Documentation
Chooses how many grid levels to display when showing world partition runtime hash. wp.Runtime.ShowRuntimeSpatialHashGridIndex. Shows a specific grid when ...
Read more >
Turning the data pipeline inside out | Georg Heiler
The asset tab shows additional metrics like the runtime of the asset ... Partitions allow to store each run (let`s assume date) uniquely...
Read more >
Runtime | SAP Help Portal
Runtime. After you have defined how you want to partition the tables during design time, you can start to run data aging.
Read more >
Asset | Archetype Documentation
A partition defines a many-to-many relation between an asset B (aggregating asset) and an asset A (aggregated asset) so that an asset from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found