Allow new attributes to be added to DataSets
See original GitHub issueDescription
I have certain attributes to track within my datasets and have created custom DataSets to get around this issue. Now that hooks are out most of my reasons for custom DataSets are gone, and I can achieve the same thing with an after_node_run
hook, but I still cannot attach custom attributes to datasets.
Use Case 1 (can I share this dataset)
I would like to attach things like confidentiality to the dataset so that team members can easily know who they can share a dataset with by looking at an attribute on the dataset. Ideally, I would like to add these to the catalog.
Use Case 2 (can I delete this sub_pipeliene)
I would also like to be able to check the pipeline health in CI, one thing that I would like to look for is dangling edges that are useless. Sometimes during refactoring we switch to a new section of the pipeline, the old one gets disconnected, never removed, and now we wonder if anyone is using that output. It would have been nice to have CI tell us that we need to mark that dataset as a final output or remove the section of pipeline.
Possible Implementation
cars:
type: pandas.CSVDataSet
filepath: data/01_raw/company/cars.csv
attributes: # 👈 this is the proposed feature, not currently in the framework
is_output: true
confidentiality: public
The AbstractDataset’s would need to accept the attributes keyword, then attach the attributes to each instance.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:10 (7 by maintainers)
Still dreaming of being able to add additional attributes to datasets so that I can access them in hooks. Is this something the kedro team is interested in allowing?
@mzjp2 Similar to https://github.com/quantumblacklabs/kedro/issues/324