Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Petastrom fails with big datasets

See original GitHub issue

Using the code from the repo Github main page as reference, my code looks like follows:

spark = SparkSession.builder.config('spark.driver.memory', '10g').master('local[4]').getOrCreate()
sc = spark.sparkContext

with materialize_dataset(spark=spark, dataset_url='file:///opt/data/hello_world_dataset',
                         schema=MySchema, row_group_size_mb=256):

    logging.info('Building RDD...')
    rows_rdd = sc.parallelize(ids)\
        .flatMap(row_generator)\  # Generator that yields lists of examples
        .map(lambda x: dict_to_spark_row(MySchema, x))

    logging.info('Creating DataFrame...')
    spark.createDataFrame(rows_rdd, MySchema.as_spark_schema()) \
        .coalesce(10) \
        .write \
        .mode('overwrite') \
        .parquet('file:///opt/data/hello_world_dataset')

Now the RDD code executes successfully but fails only the .createDataFrame call with the following error:

_pickle.PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 4GiB

This is my first experience with Spark, so I can’t really tell if this error originates in Spark or Petastorm.

Looking through other solutions to this error (in respect to Spark, not Petastorm) I saw that it might have to do with the pickling protocol, but I can’t confirm that, neither did I find a way of altering the pickling protocol.

How could I avoid this error?

Issue Analytics

State:
Created 5 years ago
Comments:20

Top GitHub Comments

1reaction

selitvincommented, Nov 22, 2018

Got it. I’ll update the ticket when we are out with the new version that should handle your case well.

0reactions

YunseokJANGcommented, Feb 9, 2021

(just in case) @miguelalonsojr I actually no longer handling this issue (changed the dataset to a smaller one). Hope the maintainers resolved this issue.

Top Results From Across the Web

Creating parquet Petastorm dataset through Spark fails with ...

This is my first experience with Spark, so I can't really tell if this error originates in Spark or Petastorm. Looking through other...

User guide — petastorm 0.12.0 documentation

This library enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm ......

How (Not) To Scale Deep Learning in 6 Easy Steps - Databricks

from petastorm import make_batch_reader from petastorm.tf_utils import ... For larger data sets and less complex networks, the I/O overhead may be larger, ...

Creating a Petastorm Dataset from ImageNet

Petastorm is an open source library for large datasets, suited for high throughput I/O ... which would fail when simply calling os.open() ....

FAQ — Ray 2.2.0 - the Ray documentation

Amazon is using Ray Datasets for large-scale I/O in their scalable data ... Supported data types: Petastorm only supports Parquet data, while Ray...