question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[KED-1272] Video and image datasets being read and written to/from memory

See original GitHub issue

Description

I am working with video and image datasets. As such, I’m either working with images or with sequences of images as raw data.

In looking through the documentation and the current available Dataset objects, the _load and _save functions seem to want to read and write entire datasets to/from memory. For images, and especially for videos, this is not feasible. In large NLP or audio contexts it also may not work, though they may be more straightforward to partition.

One option of course, is to handle data paths instead of loading the raw images. However, this will break the principle of “pure functions”, as it will incur a number of filesystem changes that the user must keep careful track of and are not handled by kedro’s api.

In summary, it’s unclear how to move forward with very high dimensional data. I’m looking for general proposals of how to move forward, maybe some input from an experienced kedro user or proposals for framework additions.

Possible Implementations

  1. A basic recipe or notebook example for implementing such a dataset/dag
  2. A graph computation solution which involves batching datasets, e.g. some iterload function for datasets, and dag expansion over a dynamically sized iterable, then merging of saved results. A sort of map-reduce on the dataset.
  3. A safe io class, create a filetype class that needs an “input”, an “output”, reads and saves data.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
921kiyocommented, Oct 16, 2019

Thank you for opening an issue! This is not perfect solution (and slightly break the “pure function” as you mentioned). You could open a file in _load and handle the partition of the file object inside the node function.

Another solution I could think of is to create a custom dataset using libraries like Dask (example https://github.com/quantumblacklabs/kedro/pull/97) to enable load in chunks.

0reactions
yetudadacommented, May 26, 2020

Hi @dasturge! We actually released an ImageDataSet that uses pillow recently. You may want to check that out. I’ll close this issue for now, let us know if you need anything else.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[KED-1272] Video and image datasets being read and written ...
Description I am working with video and image datasets. As such, I'm either working with images or with sequences of images as raw...
Read more >
TensorStore for High-Performance, Scalable Array Storage
Provides a uniform API for reading and writing multiple array ... and manipulated without having to store the entire dataset in memory, ...
Read more >
Training on Large Datasets That Don't Fit In Memory in Keras
Training your Deep Learning algorithms on a huge dataset that is too large to fit in memory? If yes, this article will be...
Read more >
Three Ways of Storing and Accessing Lots of Images in Python
In this tutorial, you'll cover three ways of storing and accessing lots of images in Python. You'll also see experimental evidence for the...
Read more >
Loading Datasets From Disk — FiftyOne 0.18.0 documentation
A directory of videos. FiftyOneImageClassificationDataset. A labeled dataset consisting of images and their associated classification labels in a simple JSON ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found