question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Benchmarked dataset iteration speed lower as expected

See original GitHub issue

Hello!

I’m really excited about the features from deeplake (streaming directly from s3, dataset versioning and filtering). However, a preliminary benchmark showed a significantly lower dataset iteration speed compared to local file storage when iterating over (256,256,3) uint8 PNGs:

  • local dataset using tf.io loader: ~70-1000 batches/s
  • local dataset using PIL loader: ~25 batches/s
  • local dataset using deeplake dataset: ~5 batches/s

Sidenote: When iterating over the deeplake dataset, download speed was ~50MB/s. When downloading a single 3GB file from S3, average download speed was ~300MB/s.

The small benchmark code is stored here: https://github.com/cgebbe/benchmark_deeplake/tree/50621dd28a08208fe70deb07d451d01474687b54

Are these numbers to be expected? I am using the library wrong or are higher speeds only available via activeloop and not S3? I hoped that the iteration speed would be at least as fast as the local PIL loader.

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
AbhinavTulicommented, Dec 14, 2022

Hey @cgebbe! Thanks for raising this issue. I see from the benchmarks, that you have used the tensorflow integration of deeplake for this. This is a very thin wrapper and is not optimized right now. We have 2 other dataloaders present, that can be used using ds.pytorch() and ds.dataloader() (the latter is an enterprise feature right now, built in CPP), both of these should give significantly better performance. Could you try using those and let us know if the issue persists?

0reactions
cgebbecommented, Dec 15, 2022

I’ll follow up the discussion here so that others can see it, too.

python3 -m pip uninstall libdeeplake; python3 -m pip install libdeeplake==0.0.32 fixed the segmentation fault issue, thanks a lot!

As promised, the optimized dataloader is slightly faster than the tensorflow dataloader using PIL:

  • using PIL: ~15-25 batches/s
  • using deeplakes optimized dataloader with torch on a r6i.xlarge instance: ~20 batches/s (at ~150MB/s)
  • using deeplakes optimized dataloader with torch on a p3.16xlarge instance: ~30 batches/s (at ~250MB/s)

@AbhinavTuli : I believe in the discussion you mentioned that you still achieve significantly higher download speeds, is this correct?

Next steps for us are to…

  • benchmark example dataset using local tfrecords files
  • run an actual training on realistic data and monitor GPU utilization. For this, we likely need to wait until the C++ loader supports tensorflow. Thanks for the support again!

Current code: https://github.com/cgebbe/benchmark_deeplake/blob/8543d1eabdb0e6c0bebd7a4700e7f5c88555c04f/README.md

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow dataloading with big datasets issue persists #2252
When you process an unshuffled dataset with map , you iterate over contiguous chunks of data, which is very fast. You get the...
Read more >
Better Data Loading: 20x PyTorch Speed-Up for Tabular Data
Datasets can be huge, and inefficient training means slower research iterations, less time for hyperparameter optimisation, longer deployment ...
Read more >
Why every GBDT speed benchmark is wrong - OpenReview
As you can see, benchmarking speed for GBDT is a hard task. Each library works better in different se- tups and on different...
Read more >
Measuring and Monitoring Arrow's Performance - Ursa Labs
We also noticed an odd pattern in the results across multiple runs (iterations) of the same benchmark. While most benchmarks were a bit...
Read more >
Avoiding Benchmarking Pitfalls on the JVM - Oracle
Use JMH to write useful benchmarks that produce accurate results. ... So we would expect a much lower throughput. We will come back...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found