Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom generator for training on out-of-memory datasets

See original GitHub issue

In https://bering-ivis.readthedocs.io/en/latest/oom_datasets.html, for out-of-memory datasets, you say to train on h5 files that exist on disk.

In my case, I can’t use h5 files, but I could use a custom generator which yields numpy array batched data.

Is there a way to provide batched data through a custom generator function? Something like keras’ fit_generator.

Thank you

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

Szubiecommented, Jan 12, 2021

Hi, we have recently introduced initial support for training on arbitrary out-of-memory datasets using ivis by formalizing the interface that input data must conform to.

Ivis will accept ivis.data.sequence.IndexableDataset instances in its fit, transform and fit_transform methods. An IndexableDataset inherits from collections.abc.Sequence and defines one new method, shape, that takes no arguments and returns the expected shape of the dataset (for example, [rows, columns]).

The collections.abc.Sequence class requires __len__ (returns number of rows) and __getitem__ (returns data at row index) to be implemented. When implementing the __getitem__ method we can customize how the data is retrieved to behave in any way desired.

As an example, we have provided a ivis.data.sequence.ImageDataset class for loading images from disk for reference, which reads image files from disk into memory when indexed.

This is still quite a new feature and we may enhance it based on the feedback we get, so any thoughts on your experience with this would be valued if you end up trying it. We also want to, in time, expand the classes we provide to cover some common use-cases. This is particularly relevant since tensorflow==2.4.0 has deprecated support for the HDF5Matrix class, making that method of out-of-memory training impossible on newer versions of TensorFlow.

0reactions

Szubiecommented, Feb 12, 2020

I don’t know if this is a custom training strategy, but if you use the keras’ fit() method, my understanding is that “batch shuffle” doesn’t shuffle rows inside batches, but it shuffles the batches order (please correct me if I’m wrong).

That’s right.

Each triplet is made up of three data points: 1) the anchor, 2) the positive example (one of the k-nearest neighbors), and 3) a negative example. The keras fit method only shuffles the anchors - when using the ‘batch’ shuffle mode, anchors are shuffled within a batch.

But each anchor data point then needs to be combined with a positive and negative example in order to create a triplet. And these points may be in a completely different part of the data, outside of the current batch of ‘anchors’.

For each anchor, we can retrieve the index of a positive example using the AnnoyIndex, but to actually retrieve the data at that index we need an indexable data structure (at least at the moment).