Regarding performance of petastorm with shuffle
See original GitHub issueHi, I have 24GB data set (1.28 million images), stored as parquet files with 800 mb each with total as 31 parquet files. I have 60 core cpu machine with 120GB RAM.
I use below code
def just_read_batch(dataset_url):
with make_batch_reader(dataset_url, num_epochs=1, workers_count=16, shuffle_row_drop_partitions=10) as train_reader:
i = 0
for schema_view in train_reader:
i += len(schema_view.imagename)
print(i)
parquet with shuffle (shuffle_row_drop_partitions = 5) : 202 seconds (PERFORMANCE IS 6k image per second) run again with shuffle : 200 seconds run again with shuffle : 200 seconds
with (shuffle_row_drop_partitions = 10) : 361 seconds run again : 370 seconds
how can I improve the performance ? I need shuffled data on each epoch. please check my code snippet again and suggest me whether I need to change configuration to get better performance ?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:7
Top Results From Across the Web
API — petastorm 0.12.0 documentation - Read the Docs
shuffle_row_drop_partitions – This is is a positive integer which determines how many partitions to break up a row group into for increased shuffling...
Read more >Introducing Petastorm: Uber ATG's Data Access Library for ...
Uber's Advanced Technologies Group introduces Petastorm, an open source data ... We are actively working on improved shuffling mechanisms.
Read more >Data Formats for Training in TensorFlow: Parquet, Petastorm ...
Machine learning is all about the data. ... KBs) this strategy is likely to severely degrade the runtime performance of the data access....
Read more >FAQ — Ray 2.2.0 - the Ray documentation
Background. How to choose a shuffling policy. What is Arrow and how does Ray Datasets use it? How much performance tuning does Ray...
Read more >arXiv:2203.05072v3 [cs.DC] 13 May 2022
systems greatly improve shuffle performance and reliability, ... an application-level shuffle design built on top of Ray, a task-.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Shorter version
shuffle_row_drop_partitions
: I would expect you to get 30K images per second, unless network/disk saturates.make_batch_reader
would not decode them for you at the moment (a bug : you would get unusable binary blobs). You should usemake_reader
which will handle the decoding correctly. Decoding the images however are likely to stress your CPUs and you may see lower throughput.Longer version
shuffle_row_drop_partitions
is a very wasteful parameter: it decreases correlation between produced samples at the expense of network traffic. In case ofshuffle_row_drop_partitions=10
, 1/10 of the downloaded samples in a rowgroup are used. (the same rowgroup would be loaded later and another 1/10 will be used next time, and so on). I’d suggest to use this switch only in extreme cases due to its wastefulness.Reading from Apache Parquet format can be done only in atomic units of a row-group.
shuffle_row_groups=True
results in random order of row-groups being consumed, but there is still inter-row-group correlation issue. Here are some ideas of reducing correlation between samples:tf_tensors
takeshuffling_queue_capacity
parameter, or you can usetf.data
facilities). I think you can also useRandomShufflingBuffer
inpetastorm.reader_impl.shuffling_buffer
- we did not intend it to be a public class, but I think it would work as a pure-Python shuffling queue implementation.materialize_dataset
takesrow_group_size_mb
. Reducing its value will decrease number of rows in a rowgroup (note that with an off-the-shelf parquet Java library implementation you can not get less then 100 rows. We have a patch I can share with you that would allow to overcome that issue)I am a bit confused with the fact that you use
make_batch_reader
: you mentioned you work with images, but for images, I would expect you to use Tensor and have your own custom Unischema.make_batch_reader
however is designed to work with standard Apache Parquet types (I think the image fields are currently unusable in this setup (we will fix it soon)).If I understand correctly, the 2-pass shuffling mentioned in the article describes an algorithm for shuffling that utilizes additional disk storage. A preprocessing step that reshuffles and persists shuffled data is definitely something that can be done be a user but is not in the scope of the responsibility of the Petastorm library…