Runtime asset partitions
See original GitHub issueI’ve seen a couple cases recently where it would be helpful to have “runtime” asset partitions. I.e. a partition is added to an asset when a job runs, rather than at definition time.
What we’ve heard
How can I handle a data asset in dagster which is partitioned - but there is no clear notion i.e. date when an update occurs? For example, let`s assume some external (public) dataset like https://data.world/buffalony/ru4s-wz29 or https://data.buffalony.gov/Quality-of-Life/Recyclable-Materials/ru4s-wz29 should be polled for changes (once every day/week/whatever) and if fresh data is available, the data is downloaded and stored in a data warehouse for this specific asset in a new partition. I.e. how can I harmonize the notion of a partitioned asset with the daily trigger to check for updates and an asset which might update only every now-and-then?
- Daniel Michaelis - https://dagster.slack.com/archives/C01U954MEER/p1652275233509229
- David Elston - https://dagster.slack.com/archives/C01U5LFUZJS/p1657569889121039?thread_ts=1657562095.567059&cid=C01U5LFUZJS
The use case I’m looking to support is to partition my assets by our customers (as we run manage many assets in a single tenant manner), I would imagine any user wanting to manage software-defined assets in a single tenant manner would stumble on the same blocker. I know there is some conflict between the defined/deterministic nature of software assets and dynamic partitioning, however I feel this is probably a relatively common use case.
We have discussed having a static config and generating the assets but our customer base is not static (and infact we actually partition again inside a particular customer).
We would effectively have to redeploy the entire repository anytime we created a new customer, which in the long run is unsustainable.
Are there plans to support dynamically partitioned Assets in the future or a way to do the following with current features? In essence, I need to process experiment results and would like to use partitioned assets for that. Every few days (inconsistent) an experiment result will be made available for Dagster to process, and I would like to handle each result as a separate partition. The list of partitions would have to grow dynamically and be re-fetched before kicking off runs for missing partitions.
- Zhiyuan Ma - https://dagster.slack.com/archives/C01U954MEER/p1663100865538039?thread_ts=1659814177.446119&cid=C01U954MEER
I am facing a similar use case. We have external service that generate datasets, and the dagster pipeline is to pick up each dataset and run through a set of ops.
- tga
hi, I am trying to pass input files (file key really) through a processing pipeline, generating the original and each intermediate step as an asset (with keys like file1_step1, file1_step2) so I have an ingest job that takes the file key and applies the graph, and the steps are called correctly, I’m just not sure how to use assets in this case
as far as I can tell, the docs treat software defined assets as a fixed thing, with a key given by the function name – not a template for generating dynamic assets based on input (edited)
that’s the part that kills it for me – if I come back later and materialize one of those virtual assets, it’s not possible (I think) to automatically rematerialize downstream assets that depend on it, because they’re not linked – so what’s the value of having assets in the first place then?
there are a few assets that it’s helpful for me to write custom logic for, but mostly it’s just the same logic over and over again on a different S3 prefix. not sure if I’m misunderstanding something fundamental but none of the docs seem to really cover it
the data I’m working with is sets of microscope images that are pretty diverse in their context (in terms of acquisition, subject, etc.) but for which there are also some standard algorithms that can be run over them. I’m not sure what the implications of trying to group them all together into a single “image” asset and then attempting to partition them appropriately would mean for us
Briefly, I have a pipeline for processing the files in a manifest, and the number of manifests will grow over time. Each manifest should be processed only once unless it’s updated. I’m wondering how to best handle the dynamic nature of my inputs (i.e. the file manifests).
Hello, I would appreciate if someone could point me in the right direction here as I am new to Dagster. I want to create a new asset (text file 2) from a source asset (text file 1). Text file 1 is stored in AWS s3 and I also want to store text file 2 in AWS s3. The files are named by a unique ID (e.g. id123456.txt), and new files show up in AWS s3 daily. The id also shows up in a database before the file shows up in AWS s3. I would like to be able to create the new asset (text file 2) from all existing source assets (text file 1) and from any new source assets (text file 1) that show up each day. Could anyone describe to me how I should be thinking about this? In my head I am thinking that I should start by using the ids in the database table to define a partitioned asset? (edited)
Design / implementation
Two flavors of this we could imagine:
- The partition is provided at the time the run or step is kicked off. This would mean that you’d still be able to “backfill” over these partitions, as well as include them in the ASSET_MATERIALIZATION_PLANNED events.
- The partition is determined by the contents of an op or IO manager.
The value of modeling these as partitions, vs., say, just config:
- You can see the history of materializations for a particular partition
- You can track partition-level lineage
- You can view the list of all the partitions
You could imagine an interaction between this and dynamic mapping - where each mapped step corresponds to one of the dynamic asset partitions.
We might want a special PartitionsDefinition
for this. E.g.
@asset(partitions_def=RuntimePartitionsDefinition())
def my_asset():
...
Issue Analytics
- State:
- Created a year ago
- Reactions:20
- Comments:6 (2 by maintainers)
Top GitHub Comments
To add another use case from a question I asked on slack
An alternative for my particular use-case would be able to chain multiple partition defs together such that all combinations are produced.