How to change compression when writing parquet files using pyspark
See original GitHub issueI found in docs that I can set the write property:“write.parquet.compression-codec” to change the parquet compression codec when using spark to write, however, I found that didn’t work.
I read an essay on stackoverflow, somebody said configure it using write.option("compression", "uncompressed")
, sadly, things didn’t get better.
Here are the ways I have tried
from pyspark.sql import SparkSession
resource_path = "/home/xxx"
if __name__ == "__main__":
spark = SparkSession \
.builder \
.master("local[*]") \
.appName("iceberg_test") \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.1") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "$PWD/warehouse") \
.config("spark.sql.parquet.compression.codec", "uncompressed") \
.config("write.parquet.compression-codec", "uncompressed") \
.getOrCreate()
# write method1:spark.sql -- doesn't work
spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
spark.sql("CREATE TABLE local.db.test1 (num int, character string) USING iceberg")
spark.sql("INSERT INTO local.db.test1 VALUES (1, 'a'), (2, 'b')")
# write method2:v1 DataFrame API -- doesn't work
spark.sql("CREATE TABLE local.db.test2 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
df.write.option("compression", "uncompressed").format("iceberg").mode("overwrite").saveAsTable("local.db.test2")
# write method3:v2 DataFrame API -- doesn't work
spark.sql("CREATE TABLE local.db.test3 (path String, modificationTime timestamp, length string, content binary) USING iceberg")
df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(resource_path)
df.writeTo("local.db.test3").option("compression", "uncompressed").append()
Is there anyone to help me ?
Issue Analytics
- State:
- Created a year ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Parquet Files - Spark 2.4.3 Documentation
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing...
Read more >How can I change the parquet compression algorithm from ...
Spark, by default, uses gzip to store parquet files. I would like to change the compression algorithm from gzip to snappy or lz4....
Read more >Spark not using spark.sql.parquet.compression.codec
Worked for me in 2.1.1 df.write.option("compression","snappy").parquet(filename).
Read more >PySpark Read and Write Parquet File - Spark by {Examples}
Since we don't have the parquet file, let's work with writing parquet from a DataFrame. First, create a Pyspark DataFrame from a list...
Read more >Writing Dataframe - Pyspark tutorials - WordPress.com
when writing the parquet format to hdfs , we can make use of dataframe write operation to write the parquet ,but when we...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
So did you try using the table property
write.parquet.compression-codec
? This is a table property you need to set. It looks like you tried using only the Spark property and never the Iceberg one.On write I believe you can pass through the same thing as a write option … but I haven’t checked, we don’t have it doc’d.
Yes the table property
write.parquet.compression-codec
is best. It looks likewrite-format
can be set as an optiion for individual writes, but for Iceberg, the table level propertywrite.parquet.compression-codec
is what you want.You can update it later if you want to try something new, and then future writes to that Iceberg table will use that codec for parquet files. Generally the codec isn’t changed very often.