Benchmarked dataset iteration speed lower as expected
See original GitHub issueHello!
I’m really excited about the features from deeplake (streaming directly from s3, dataset versioning and filtering). However, a preliminary benchmark showed a significantly lower dataset iteration speed compared to local file storage when iterating over (256,256,3)
uint8 PNGs:
- local dataset using tf.io loader: ~70-1000 batches/s
- local dataset using PIL loader: ~25 batches/s
- local dataset using deeplake dataset: ~5 batches/s
Sidenote: When iterating over the deeplake dataset, download speed was ~50MB/s. When downloading a single 3GB file from S3, average download speed was ~300MB/s.
The small benchmark code is stored here: https://github.com/cgebbe/benchmark_deeplake/tree/50621dd28a08208fe70deb07d451d01474687b54
Are these numbers to be expected? I am using the library wrong or are higher speeds only available via activeloop and not S3? I hoped that the iteration speed would be at least as fast as the local PIL loader.
Issue Analytics
- State:
- Created 9 months ago
- Comments:9
Hey @cgebbe! Thanks for raising this issue. I see from the benchmarks, that you have used the tensorflow integration of deeplake for this. This is a very thin wrapper and is not optimized right now. We have 2 other dataloaders present, that can be used using ds.pytorch() and ds.dataloader() (the latter is an enterprise feature right now, built in CPP), both of these should give significantly better performance. Could you try using those and let us know if the issue persists?
I’ll follow up the discussion here so that others can see it, too.
python3 -m pip uninstall libdeeplake; python3 -m pip install libdeeplake==0.0.32
fixed thesegmentation fault
issue, thanks a lot!As promised, the optimized dataloader is slightly faster than the tensorflow dataloader using PIL:
@AbhinavTuli : I believe in the discussion you mentioned that you still achieve significantly higher download speeds, is this correct?
Next steps for us are to…
Current code: https://github.com/cgebbe/benchmark_deeplake/blob/8543d1eabdb0e6c0bebd7a4700e7f5c88555c04f/README.md