Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regarding performance of petastorm with shuffle

See original GitHub issue

Hi, I have 24GB data set (1.28 million images), stored as parquet files with 800 mb each with total as 31 parquet files. I have 60 core cpu machine with 120GB RAM.

I use below code

def just_read_batch(dataset_url):
    with make_batch_reader(dataset_url, num_epochs=1, workers_count=16, shuffle_row_drop_partitions=10) as train_reader:
        i = 0
        for schema_view in train_reader:
            i += len(schema_view.imagename)
        print(i)

parquet with shuffle (shuffle_row_drop_partitions = 5) : 202 seconds (PERFORMANCE IS 6k image per second) run again with shuffle : 200 seconds run again with shuffle : 200 seconds

with (shuffle_row_drop_partitions = 10) : 361 seconds run again : 370 seconds

how can I improve the performance ? I need shuffled data on each epoch. please check my code snippet again and suggest me whether I need to change configuration to get better performance ?

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:7

Top GitHub Comments

1reaction

selitvincommented, Mar 14, 2019

Shorter version

Don’t set shuffle_row_drop_partitions: I would expect you to get 30K images per second, unless network/disk saturates.
If your images are encoded using Petastorm codec, using make_batch_reader would not decode them for you at the moment (a bug : you would get unusable binary blobs). You should use make_reader which will handle the decoding correctly. Decoding the images however are likely to stress your CPUs and you may see lower throughput.
Place a shuffling queue after the reader to shuffle your data.

Longer version

shuffle_row_drop_partitions is a very wasteful parameter: it decreases correlation between produced samples at the expense of network traffic. In case of shuffle_row_drop_partitions=10, 1/10 of the downloaded samples in a rowgroup are used. (the same rowgroup would be loaded later and another 1/10 will be used next time, and so on). I’d suggest to use this switch only in extreme cases due to its wastefulness.

Reading from Apache Parquet format can be done only in atomic units of a row-group. shuffle_row_groups=True results in random order of row-groups being consumed, but there is still inter-row-group correlation issue. Here are some ideas of reducing correlation between samples:

Placing a shuffling queue after the reader. Usually we rely on Tensorflow shuffling queue (tf_tensors take shuffling_queue_capacity parameter, or you can use tf.data facilities). I think you can also use RandomShufflingBuffer in petastorm.reader_impl.shuffling_buffer - we did not intend it to be a public class, but I think it would work as a pure-Python shuffling queue implementation.
Reduce size of rowgroup (hence decrease amount of correlated samples). If you generate your dataset using Petastorm, materialize_dataset takes row_group_size_mb. Reducing its value will decrease number of rows in a rowgroup (note that with an off-the-shelf parquet Java library implementation you can not get less then 100 rows. We have a patch I can share with you that would allow to overcome that issue)
Depends on the nature of data you are using, shuffling the rows in the dataset before writing it out may help (we preshuffle this with some of our datasets).

I am a bit confused with the fact that you use make_batch_reader: you mentioned you work with images, but for images, I would expect you to use Tensor and have your own custom Unischema. make_batch_reader however is designed to work with standard Apache Parquet types (I think the image fields are currently unusable in this setup (we will fix it soon)).

0reactions

selitvincommented, Jan 15, 2020

If I understand correctly, the 2-pass shuffling mentioned in the article describes an algorithm for shuffling that utilizes additional disk storage. A preprocessing step that reshuffles and persists shuffled data is definitely something that can be done be a user but is not in the scope of the responsibility of the Petastorm library…

Top Results From Across the Web

API — petastorm 0.12.0 documentation - Read the Docs

shuffle_row_drop_partitions – This is is a positive integer which determines how many partitions to break up a row group into for increased shuffling...

Introducing Petastorm: Uber ATG's Data Access Library for ...

Uber's Advanced Technologies Group introduces Petastorm, an open source data ... We are actively working on improved shuffling mechanisms.

Data Formats for Training in TensorFlow: Parquet, Petastorm ...

Machine learning is all about the data. ... KBs) this strategy is likely to severely degrade the runtime performance of the data access....

FAQ — Ray 2.2.0 - the Ray documentation

Background. How to choose a shuffling policy. What is Arrow and how does Ray Datasets use it? How much performance tuning does Ray...

arXiv:2203.05072v3 [cs.DC] 13 May 2022

systems greatly improve shuffle performance and reliability, ... an application-level shuffle design built on top of Ray, a task-.