question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle datasets with many small graphs

See original GitHub issue

Hi, I’ve created my first dataset based on InMemoryDataset it works pretty well and working with it is really smooth.

More recently, I’ve added many more examples and features to my dataset it doesn’t fit in RAM anymore. I’ve created another class based on Dataset but I now have more than 500K+ tiny files (~15ko each) in my dataset folder. But now, when I try to load them it’s much slower (even with 8 dataloader workers), I guess my HDD is can’t handle this I/O rate.

How do you handle such kind of datasets ? Is there a simple/clean solution to save let’s say 100 or 1000 graphs inside a single file in order to limit the total amount of tiny files ?

Thanks,

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
RobinFrcdcommented, Feb 15, 2019

Alright, thanks for your help! I’m trying to rewrite this to save files of size=batch_size on my disk. I’ll then use a data loader with batch_size=1 to be able to use PyTorch data loaders features like queues and multi workers (don’t if multi workers are useful with a bach_size=1 tho), hope this I’ll help!

I’ll keep you in touch!

1reaction
rusty1scommented, Feb 15, 2019

Interesting question! In fact, PyTorch datasets are quite limited for this special case 😦 I think your proposed solution is the way to go, but we currently do not have a clean solution. However, I hope this snippet helps you (untested/pseudo code):

def process(self):
    data_list = []
    for i in range(num_graphs):
        ...
        data_list.append(data)
        if i % 99 == 0:
            batch = Batch.from_data_list(data_list)
            batch.mini_batch = batch.mini_batch # Internal mini batch.
            batch.batch = None # Delete batch vector so we can still use external batching. 
            torch.save(batch, 'path_{}'.format(i // 100)
            data_list = []

def get(self, idx):
       return torch.load('path_{}'.format(idx)

Let me know if this speed things up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handle datasets with many small graphs · Issue #93 - GitHub
Hi, I've created my first dataset based on InMemoryDataset it works pretty well and working with it is really smooth.
Read more >
Dealing with very small datasets | Kaggle
In this kernel we will see some techniques to handle very small datasets, where the main challenge is to avoid overfitting. Why small...
Read more >
How To Use Deep Learning Even with Small Data
The basic idea with fine-tuning is to take a very large data set which is hopefully somewhat similar to your domain, train a...
Read more >
Advanced Mini-Batching — pytorch_geometric documentation
PyG automatically takes care of batching multiple graphs into a single giant graph with the help of the torch_geometric.
Read more >
Data modes - Spektral
Also, it's not always the case that we have many graphs in our datasets. Sometimes, we're just interested in classifying the nodes of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found