Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle datasets with many small graphs

See original GitHub issue

Hi, I’ve created my first dataset based on InMemoryDataset it works pretty well and working with it is really smooth.

More recently, I’ve added many more examples and features to my dataset it doesn’t fit in RAM anymore. I’ve created another class based on Dataset but I now have more than 500K+ tiny files (~15ko each) in my dataset folder. But now, when I try to load them it’s much slower (even with 8 dataloader workers), I guess my HDD is can’t handle this I/O rate.

How do you handle such kind of datasets ? Is there a simple/clean solution to save let’s say 100 or 1000 graphs inside a single file in order to limit the total amount of tiny files ?

Thanks,

Issue Analytics

State:
Created 5 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

RobinFrcdcommented, Feb 15, 2019

Alright, thanks for your help! I’m trying to rewrite this to save files of size=batch_size on my disk. I’ll then use a data loader with batch_size=1 to be able to use PyTorch data loaders features like queues and multi workers (don’t if multi workers are useful with a bach_size=1 tho), hope this I’ll help!

I’ll keep you in touch!

1reaction

rusty1scommented, Feb 15, 2019

Interesting question! In fact, PyTorch datasets are quite limited for this special case 😦 I think your proposed solution is the way to go, but we currently do not have a clean solution. However, I hope this snippet helps you (untested/pseudo code):

def process(self):
    data_list = []
    for i in range(num_graphs):
        ...
        data_list.append(data)
        if i % 99 == 0:
            batch = Batch.from_data_list(data_list)
            batch.mini_batch = batch.mini_batch # Internal mini batch.
            batch.batch = None # Delete batch vector so we can still use external batching. 
            torch.save(batch, 'path_{}'.format(i // 100)
            data_list = []

def get(self, idx):
       return torch.load('path_{}'.format(idx)

Let me know if this speed things up.