question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Petastrom fails with big datasets

See original GitHub issue

Using the code from the repo Github main page as reference, my code looks like follows:

spark = SparkSession.builder.config('spark.driver.memory', '10g').master('local[4]').getOrCreate()
sc = spark.sparkContext

with materialize_dataset(spark=spark, dataset_url='file:///opt/data/hello_world_dataset',
                         schema=MySchema, row_group_size_mb=256):

    logging.info('Building RDD...')
    rows_rdd = sc.parallelize(ids)\
        .flatMap(row_generator)\  # Generator that yields lists of examples
        .map(lambda x: dict_to_spark_row(MySchema, x))

    logging.info('Creating DataFrame...')
    spark.createDataFrame(rows_rdd, MySchema.as_spark_schema()) \
        .coalesce(10) \
        .write \
        .mode('overwrite') \
        .parquet('file:///opt/data/hello_world_dataset') 

Now the RDD code executes successfully but fails only the .createDataFrame call with the following error:

_pickle.PicklingError: Could not serialize broadcast: OverflowError: cannot serialize a string larger than 4GiB

This is my first experience with Spark, so I can’t really tell if this error originates in Spark or Petastorm.

Looking through other solutions to this error (in respect to Spark, not Petastorm) I saw that it might have to do with the pickling protocol, but I can’t confirm that, neither did I find a way of altering the pickling protocol.

How could I avoid this error?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:20

github_iconTop GitHub Comments

1reaction
selitvincommented, Nov 22, 2018

Got it. I’ll update the ticket when we are out with the new version that should handle your case well.

0reactions
YunseokJANGcommented, Feb 9, 2021

(just in case) @miguelalonsojr I actually no longer handling this issue (changed the dataset to a smaller one). Hope the maintainers resolved this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating parquet Petastorm dataset through Spark fails with ...
This is my first experience with Spark, so I can't really tell if this error originates in Spark or Petastorm. Looking through other...
Read more >
User guide — petastorm 0.12.0 documentation
This library enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm ......
Read more >
How (Not) To Scale Deep Learning in 6 Easy Steps - Databricks
from petastorm import make_batch_reader from petastorm.tf_utils import ... For larger data sets and less complex networks, the I/O overhead may be larger, ...
Read more >
Creating a Petastorm Dataset from ImageNet
Petastorm is an open source library for large datasets, suited for high throughput I/O ... which would fail when simply calling os.open() ....
Read more >
FAQ — Ray 2.2.0 - the Ray documentation
Amazon is using Ray Datasets for large-scale I/O in their scalable data ... Supported data types: Petastorm only supports Parquet data, while Ray...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found