Custom generator for training on out-of-memory datasets
See original GitHub issueIn https://bering-ivis.readthedocs.io/en/latest/oom_datasets.html, for out-of-memory datasets, you say to train on h5 files that exist on disk.
In my case, I can’t use h5 files, but I could use a custom generator which yields numpy array batched data.
Is there a way to provide batched data through a custom generator function? Something like keras’ fit_generator
.
Thank you
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
CUSTOMIZED DATA GENERATOR IN KERAS - Medium
The data generator aims to retrieve small batches of data from your hard disk and supplies it to train, test, and validation function...
Read more >Keras: Training on Large Datasets That Don't Fit In Memory
To do so, we are going to create a custom generator. Our Custom Generator is going to load the dataset from the hard...
Read more >Creating A Custom Data Generator In Keras | by Aditya Mohanty
When training a deep learning model we would also need huge amount of data for… ... Before going deeper into the custom data...
Read more >Python Generators to reduce RAM usage [ Part 2 ] - Kaggle
To achieve this I used custom python generators with a batch size of 10. Using Generators may be helpful to those who want...
Read more >Approach of training a large image data set using keras ...
What I usually do is write my own data-generator (only a function that has to have the keyworld yield followed by the created...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, we have recently introduced initial support for training on arbitrary out-of-memory datasets using ivis by formalizing the interface that input data must conform to.
Ivis will accept
ivis.data.sequence.IndexableDataset
instances in itsfit
,transform
andfit_transform
methods. An IndexableDataset inherits fromcollections.abc.Sequence
and defines one new method,shape
, that takes no arguments and returns the expected shape of the dataset (for example, [rows, columns]).The
collections.abc.Sequence
class requires__len__
(returns number of rows) and__getitem__
(returns data at row index) to be implemented. When implementing the__getitem__
method we can customize how the data is retrieved to behave in any way desired.As an example, we have provided a
ivis.data.sequence.ImageDataset
class for loading images from disk for reference, which reads image files from disk into memory when indexed.This is still quite a new feature and we may enhance it based on the feedback we get, so any thoughts on your experience with this would be valued if you end up trying it. We also want to, in time, expand the classes we provide to cover some common use-cases. This is particularly relevant since tensorflow==2.4.0 has deprecated support for the HDF5Matrix class, making that method of out-of-memory training impossible on newer versions of TensorFlow.
That’s right.
Each triplet is made up of three data points: 1) the anchor, 2) the positive example (one of the k-nearest neighbors), and 3) a negative example. The keras
fit
method only shuffles the anchors - when using the ‘batch’ shuffle mode, anchors are shuffled within a batch.But each anchor data point then needs to be combined with a positive and negative example in order to create a triplet. And these points may be in a completely different part of the data, outside of the current batch of ‘anchors’.
For each anchor, we can retrieve the index of a positive example using the AnnoyIndex, but to actually retrieve the data at that index we need an indexable data structure (at least at the moment).