Interface for datasets that are too large to use `InMemoryDataset`
See original GitHub issue🚀 The feature, motivation and pitch
There are several examples of datasets for molecular property prediction where each individual graph example easily fits in memory but there are too many examples to fit within the InMemoryDataset
interface. One solution is to save each example in its own .pt
file but this introduces a significant filesystem overhead to access each example.
A better solution is to partition the data such that there are many graphs serialised within a single .pt
file. The number of graphs can be considered a chunk_size
parameter which is independent from the training batch_size
. This ChunkedDataset
interface would be expected to scale to as large a dataset as desired while avoiding the significant overhead of having one graph per file.
The design idea is roughly:
ChunkedDataset
inherits from the PyGDataset
interface- Accepts a
chunk_size
argument - Has an abstract method
process_chunk
that accepts a list of data objects that can be processed and saved as a single.pt
file.
Other considerations:
- The training batch size should not depend on the
chunk_size
so the dataset ChunkedDataset
should support splitting to read from parallel workers as well as random shuffling
Alternatives
No response
Additional context
No response
Issue Analytics
- State:
- Created a year ago
- Reactions:5
- Comments:21 (14 by maintainers)
Top GitHub Comments
Hi, I’d like to build this chunkeddataset support if no one else is doing it. 😃 My own project also needs that support.
So basically my current thoughts are that users need to write the chunk logic in
process()
and store them in some data structures such asself.chunked_data
andself.chunked_slices
(both are lists right now, compared with a single data/slices inInMemoryDataset
), so that the ChunkedDataset will be able to load them in thelen()
andget()
method (can be on-demand or prefetching)?I have implemented a hard-coded version of
chunkeddataset
(only apply to my dataset) and it worked. I will work to retrofit my code to a generic version.