question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to change compression when writing parquet files using pyspark

See original GitHub issue

I found in docs that I can set the write property:“write.parquet.compression-codec” to change the parquet compression codec when using spark to write, however, I found that didn’t work. I read an essay on stackoverflow, somebody said configure it using write.option("compression", "uncompressed"), sadly, things didn’t get better.

Here are the ways I have tried

from pyspark.sql import SparkSession

resource_path = "/home/xxx"

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .master("local[*]") \
        .appName("iceberg_test") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
        .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.local.type", "hadoop") \
        .config("spark.sql.catalog.local.warehouse", "$PWD/warehouse") \
        .config("spark.sql.parquet.compression.codec", "uncompressed") \
        .config("write.parquet.compression-codec", "uncompressed") \
        .getOrCreate()

    # write method1:spark.sql -- doesn't work
    spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
    spark.sql("CREATE TABLE local.db.test1 (num int, character string) USING iceberg")
    spark.sql("INSERT INTO local.db.test1 VALUES (1, 'a'), (2, 'b')")

    # write method2:v1 DataFrame API -- doesn't work
    spark.sql("CREATE TABLE local.db.test2 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
    df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
    df.write.option("compression", "uncompressed").format("iceberg").mode("overwrite").saveAsTable("local.db.test2")

    # write method3:v2 DataFrame API -- doesn't work
    spark.sql("CREATE TABLE local.db.test3 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
    df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
    df.writeTo("local.db.test3").option("compression", "uncompressed").append()

Is there anyone to help me ?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
RussellSpitzercommented, May 31, 2022

So did you try using the table property write.parquet.compression-codec? This is a table property you need to set. It looks like you tried using only the Spark property and never the Iceberg one.

On write I believe you can pass through the same thing as a write option … but I haven’t checked, we don’t have it doc’d.

1reaction
kbendickcommented, May 31, 2022

Yes the table property write.parquet.compression-codec is best. It looks like write-format can be set as an optiion for individual writes, but for Iceberg, the table level property write.parquet.compression-codec is what you want.

You can update it later if you want to try something new, and then future writes to that Iceberg table will use that codec for parquet files. Generally the codec isn’t changed very often.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parquet Files - Spark 2.4.3 Documentation
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing...
Read more >
How can I change the parquet compression algorithm from ...
Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4....
Read more >
Spark not using spark.sql.parquet.compression.codec
Worked for me in 2.1.1 df.write.option("compression","snappy").parquet(filename).
Read more >
PySpark Read and Write Parquet File - Spark by {Examples}
Since we don't have the parquet file, let's work with writing parquet from a DataFrame. First, create a Pyspark DataFrame from a list...
Read more >
Writing Dataframe - Pyspark tutorials - WordPress.com
when writing the parquet format to hdfs , we can make use of dataframe write operation to write the parquet ,but when we...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found