[KED-1272] Video and image datasets being read and written to/from memory
See original GitHub issueDescription
I am working with video and image datasets. As such, I’m either working with images or with sequences of images as raw data.
In looking through the documentation and the current available Dataset objects, the _load
and _save
functions seem to want to read and write entire datasets to/from memory. For images, and especially for videos, this is not feasible. In large NLP or audio contexts it also may not work, though they may be more straightforward to partition.
One option of course, is to handle data paths instead of loading the raw images. However, this will break the principle of “pure functions”, as it will incur a number of filesystem changes that the user must keep careful track of and are not handled by kedro’s api.
In summary, it’s unclear how to move forward with very high dimensional data. I’m looking for general proposals of how to move forward, maybe some input from an experienced kedro user or proposals for framework additions.
Possible Implementations
- A basic recipe or notebook example for implementing such a dataset/dag
- A graph computation solution which involves batching datasets, e.g. some iterload function for datasets, and dag expansion over a dynamically sized iterable, then merging of saved results. A sort of map-reduce on the dataset.
- A safe io class, create a filetype class that needs an “input”, an “output”, reads and saves data.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top GitHub Comments
Thank you for opening an issue! This is not perfect solution (and slightly break the “pure function” as you mentioned). You could open a file in
_load
and handle the partition of the file object inside the node function.Another solution I could think of is to create a custom dataset using libraries like Dask (example https://github.com/quantumblacklabs/kedro/pull/97) to enable load in chunks.
Hi @dasturge! We actually released an
ImageDataSet
that usespillow
recently. You may want to check that out. I’ll close this issue for now, let us know if you need anything else.