Handle datasets with many small graphs
See original GitHub issueHi,
I’ve created my first dataset based on InMemoryDataset
it works pretty well and working with it is really smooth.
More recently, I’ve added many more examples and features to my dataset it doesn’t fit in RAM anymore. I’ve created another class based on Dataset but I now have more than 500K+ tiny files (~15ko each) in my dataset folder. But now, when I try to load them it’s much slower (even with 8 dataloader workers), I guess my HDD is can’t handle this I/O rate.
How do you handle such kind of datasets ? Is there a simple/clean solution to save let’s say 100 or 1000 graphs inside a single file in order to limit the total amount of tiny files ?
Thanks,
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Handle datasets with many small graphs · Issue #93 - GitHub
Hi, I've created my first dataset based on InMemoryDataset it works pretty well and working with it is really smooth.
Read more >Dealing with very small datasets | Kaggle
In this kernel we will see some techniques to handle very small datasets, where the main challenge is to avoid overfitting. Why small...
Read more >How To Use Deep Learning Even with Small Data
The basic idea with fine-tuning is to take a very large data set which is hopefully somewhat similar to your domain, train a...
Read more >Advanced Mini-Batching — pytorch_geometric documentation
PyG automatically takes care of batching multiple graphs into a single giant graph with the help of the torch_geometric.
Read more >Data modes - Spektral
Also, it's not always the case that we have many graphs in our datasets. Sometimes, we're just interested in classifying the nodes of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Alright, thanks for your help! I’m trying to rewrite this to save files of size=
batch_size
on my disk. I’ll then use a data loader withbatch_size=1
to be able to use PyTorch data loaders features like queues and multi workers (don’t if multi workers are useful with a bach_size=1 tho), hope this I’ll help!I’ll keep you in touch!
Interesting question! In fact, PyTorch datasets are quite limited for this special case 😦 I think your proposed solution is the way to go, but we currently do not have a clean solution. However, I hope this snippet helps you (untested/pseudo code):
Let me know if this speed things up.