Petastrom fails with big datasets
See original GitHub issueUsing the code from the repo Github main page as reference, my code looks like follows:
spark = SparkSession.builder.config('spark.driver.memory', '10g').master('local[4]').getOrCreate()
sc = spark.sparkContext
with materialize_dataset(spark=spark, dataset_url='file:///opt/data/hello_world_dataset',
schema=MySchema, row_group_size_mb=256):
logging.info('Building RDD...')
rows_rdd = sc.parallelize(ids)\
.flatMap(row_generator)\ # Generator that yields lists of examples
.map(lambda x: dict_to_spark_row(MySchema, x))
logging.info('Creating DataFrame...')
spark.createDataFrame(rows_rdd, MySchema.as_spark_schema()) \
.coalesce(10) \
.write \
.mode('overwrite') \
.parquet('file:///opt/data/hello_world_dataset')
Now the RDD code executes successfully but fails only the .createDataFrame
call with the following error:
_pickle.PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 4GiB
This is my first experience with Spark, so I can’t really tell if this error originates in Spark or Petastorm.
Looking through other solutions to this error (in respect to Spark, not Petastorm) I saw that it might have to do with the pickling protocol, but I can’t confirm that, neither did I find a way of altering the pickling protocol.
How could I avoid this error?
Issue Analytics
- State:
- Created 5 years ago
- Comments:20
Top Results From Across the Web
Creating parquet Petastorm dataset through Spark fails with ...
This is my first experience with Spark, so I can't really tell if this error originates in Spark or Petastorm. Looking through other...
Read more >User guide — petastorm 0.12.0 documentation
This library enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm ......
Read more >How (Not) To Scale Deep Learning in 6 Easy Steps - Databricks
from petastorm import make_batch_reader from petastorm.tf_utils import ... For larger data sets and less complex networks, the I/O overhead may be larger, ...
Read more >Creating a Petastorm Dataset from ImageNet
Petastorm is an open source library for large datasets, suited for high throughput I/O ... which would fail when simply calling os.open() ....
Read more >FAQ — Ray 2.2.0 - the Ray documentation
Amazon is using Ray Datasets for large-scale I/O in their scalable data ... Supported data types: Petastorm only supports Parquet data, while Ray...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Got it. I’ll update the ticket when we are out with the new version that should handle your case well.
(just in case) @miguelalonsojr I actually no longer handling this issue (changed the dataset to a smaller one). Hope the maintainers resolved this issue.