question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance comparison of make_reader() & make_petastorm_dataset() vs make_spark_converter() & make_tf_dataset()

See original GitHub issue

From the API user guide, it seems that there are two different way of using Petastorm to train tensorflow models

  1. Using make_reader() or make_batch_reader() and then using make_petastorm_dataset() to create a tf.data object
  2. Using make_spark_converter() to materialize the dataset and then using converter.make_tf_dataset() to create a tf.data object

All things equal, which of these would be expected to have faster performance? I know that option 1 reads from a file path while option 2 starts with a spark dataframe. Option 2 seems simpler, but is there a loss of performance that would be associated with it?

Thanks

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
selitvincommented, Feb 3, 2021

make_spark_converter -> make_tf_dataset uses make_batch_reader+make_petastorm_dataset underneath (to read from the temporary parquet store created).

Can you please provide more information on the slowdown?

  • 50x slowdown: is this a 50x slowdown as measured for each fwd/bwd propagation iteration?
  • What kind of IO method you use for your baseline (when talking about 50x slowdown - comparing to what?). Could there be additional differences between your IO methods?
  • What is the data distribution? How many fields in your row? How many rows in a rowgroup?

It would be best if you could distill a small example I could actually run and profile. It might be hard to see the issue just from the code as it’s likely about the combination of the code and the data structure underneath.

0reactions
selitvincommented, Feb 9, 2021

No problem at all…

Can you please take a look at the horovod example:

https://github.com/horovod/horovod/blob/master/examples/spark/keras/keras_spark_rossmann_run.py, I know they were polishing training pipeline performance and have a good batch-based implementation. Perhaps it will give you some clues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found