Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to change compression when writing parquet files using pyspark

See original GitHub issue

I found in docs that I can set the write property:“write.parquet.compression-codec” to change the parquet compression codec when using spark to write, however, I found that didn’t work. I read an essay on stackoverflow, somebody said configure it using write.option("compression", "uncompressed"), sadly, things didn’t get better.

Here are the ways I have tried

from pyspark.sql import SparkSession

resource_path = "/home/xxx"

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .master("local[*]") \
        .appName("iceberg_test") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
        .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.local.type", "hadoop") \
        .config("spark.sql.catalog.local.warehouse", "$PWD/warehouse") \
        .config("spark.sql.parquet.compression.codec", "uncompressed") \
        .config("write.parquet.compression-codec", "uncompressed") \
        .getOrCreate()

    # write method1:spark.sql -- doesn't work
    spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
    spark.sql("CREATE TABLE local.db.test1 (num int, character string) USING iceberg")
    spark.sql("INSERT INTO local.db.test1 VALUES (1, 'a'), (2, 'b')")

    # write method2:v1 DataFrame API -- doesn't work
    spark.sql("CREATE TABLE local.db.test2 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
    df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
    df.write.option("compression", "uncompressed").format("iceberg").mode("overwrite").saveAsTable("local.db.test2")

    # write method3:v2 DataFrame API -- doesn't work
    spark.sql("CREATE TABLE local.db.test3 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
    df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
    df.writeTo("local.db.test3").option("compression", "uncompressed").append()

Is there anyone to help me ?

Issue Analytics

State:
Created a year ago
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

RussellSpitzercommented, May 31, 2022

So did you try using the table property write.parquet.compression-codec? This is a table property you need to set. It looks like you tried using only the Spark property and never the Iceberg one.

On write I believe you can pass through the same thing as a write option … but I haven’t checked, we don’t have it doc’d.

1reaction

kbendickcommented, May 31, 2022

Yes the table property write.parquet.compression-codec is best. It looks like write-format can be set as an optiion for individual writes, but for Iceberg, the table level property write.parquet.compression-codec is what you want.

You can update it later if you want to try something new, and then future writes to that Iceberg table will use that codec for parquet files. Generally the codec isn’t changed very often.