question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Interface for datasets that are too large to use `InMemoryDataset`

See original GitHub issue

🚀 The feature, motivation and pitch

There are several examples of datasets for molecular property prediction where each individual graph example easily fits in memory but there are too many examples to fit within the InMemoryDataset interface. One solution is to save each example in its own .pt file but this introduces a significant filesystem overhead to access each example.

A better solution is to partition the data such that there are many graphs serialised within a single .pt file. The number of graphs can be considered a chunk_size parameter which is independent from the training batch_size. This ChunkedDataset interface would be expected to scale to as large a dataset as desired while avoiding the significant overhead of having one graph per file.

The design idea is roughly:

  • ChunkedDataset inherits from the PyG Dataset interface
  • Accepts a chunk_size argument
  • Has an abstract method process_chunk that accepts a list of data objects that can be processed and saved as a single .pt file.

Other considerations:

  • The training batch size should not depend on the chunk_size so the dataset
  • ChunkedDataset should support splitting to read from parallel workers as well as random shuffling

Alternatives

No response

Additional context

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:5
  • Comments:21 (14 by maintainers)

github_iconTop GitHub Comments

6reactions
LiuHaolancommented, Jun 2, 2022

Hi, I’d like to build this chunkeddataset support if no one else is doing it. 😃 My own project also needs that support.

4reactions
LiuHaolancommented, Jun 16, 2022

So basically my current thoughts are that users need to write the chunk logic in process() and store them in some data structures such as self.chunked_data and self.chunked_slices (both are lists right now, compared with a single data/slices in InMemoryDataset), so that the ChunkedDataset will be able to load them in the len() and get() method (can be on-demand or prefetching)?

I have implemented a hard-coded version of chunkeddataset (only apply to my dataset) and it worked. I will work to retrofit my code to a generic version.

I agree, we probably need to require a process_example method and do any logic of creating chunks internally. WDYT?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating Your Own Datasets - PyTorch Geometric
Because saving a huge python list is rather slow, we collate the list into one huge Data object via torch_geometric.data.InMemoryDataset.collate() before saving ...
Read more >
Big data? Datasets to the rescue! - Hugging Face Course
For comparison, let's see how large the dataset is on disk, using the dataset_size attribute. Since the result is expressed in bytes like...
Read more >
Is there a dataset functionality similar to PyTorch Geometrics ...
Hi so I've used Pytorch-Geometric for a while and have now returned ... import torch from torch_geometric.data import InMemoryDataset class ...
Read more >
Tabular Datasets — Apache Arrow v10.0.1
The pyarrow.dataset module provides functionality to efficiently work with ... For workloads writing a lot of data, files can get very large without...
Read more >
Creating a Custom Dataset in Pytorch Geometric - YouTube
GNN Project #3.2 - Graph Transformer · GNN Project #3.1 - Graph-level predictions · Gaming App Ui Design || Flutter App | Day...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found