Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance comparison of make_reader() & make_petastorm_dataset() vs make_spark_converter() & make_tf_dataset()

See original GitHub issue

From the API user guide, it seems that there are two different way of using Petastorm to train tensorflow models

Using make_reader() or make_batch_reader() and then using make_petastorm_dataset() to create a tf.data object
Using make_spark_converter() to materialize the dataset and then using converter.make_tf_dataset() to create a tf.data object

All things equal, which of these would be expected to have faster performance? I know that option 1 reads from a file path while option 2 starts with a spark dataframe. Option 2 seems simpler, but is there a loss of performance that would be associated with it?

Thanks

Issue Analytics

State:
Created 3 years ago
Comments:8

Top GitHub Comments

1reaction

selitvincommented, Feb 3, 2021

make_spark_converter -> make_tf_dataset uses make_batch_reader+make_petastorm_dataset underneath (to read from the temporary parquet store created).

Can you please provide more information on the slowdown?

50x slowdown: is this a 50x slowdown as measured for each fwd/bwd propagation iteration?
What kind of IO method you use for your baseline (when talking about 50x slowdown - comparing to what?). Could there be additional differences between your IO methods?
What is the data distribution? How many fields in your row? How many rows in a rowgroup?

It would be best if you could distill a small example I could actually run and profile. It might be hard to see the issue just from the code as it’s likely about the combination of the code and the data structure underneath.

0reactions

selitvincommented, Feb 9, 2021

No problem at all…

Can you please take a look at the horovod example:

https://github.com/horovod/horovod/blob/master/examples/spark/keras/keras_spark_rossmann_run.py, I know they were polishing training pipeline performance and have a good batch-based implementation. Perhaps it will give you some clues.