DataFrameWriterV2 doesn't create partitions when using partitionedBy
See original GitHub issueWhen I try using the DataFrameWriterV2 to create an Iceberg table, the partitioning columns seem to get ignored (thereby not partitioned) even though they are specified.
Spark version: 3.1.2 Iceberg version: 0.13.0
Note: The application is written in python.
orders_df.writeTo("iceberg_catalog.p2.orders").using("iceberg").option(
"fanout-enabled", "true"
).partitionedBy("date(pickupts)", "appname").create()
Using Spark SQL to describe the table using this way shows this:
# Partitioning
Not partitioned
The workaround for creating the tables with the appropriate partitions is via Spark SQL.
i.e. this works as expected (see note below snippet):
orders_df.createOrReplaceTempView("p2_orders_view")
spark.sql(
"""
create table iceberg_catalog.p2.orders using iceberg
PARTITIONED BY (date(pickupts), appname)
as select * from p2_orders_view
"""
)
Here, when using Spark SQL to describe the table using this above method to create the table shows this:
# Partitioning
Part 0 days(pickupts)
Part 1 appname
My question is, am I using the DataFrameWriterV2 correctly when specifying the partitionedBy columns?
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (4 by maintainers)
Top Results From Across the Web
DataFrameWriterV2 (Spark 3.0.0 JavaDoc)
Partition the output table created by create , createOrReplace , or replace using the given columns or transforms. When specified, the table data...
Read more >Writing multiple partition specs to Apache Iceberg table
For batch writes, partition expressions are now supported in partitionBy using Spark 3 new DataFrameWriterV2 API:
Read more >[SPARK-28612][SQL] Add DataFrameWriterV2 API
Does not allow table configuration options for operations that cannot ... Unit + + /** + * Partition the output table created by...
Read more >Table batch reads and writes - Delta Lake Documentation
You can also create Delta tables using the Spark DataFrameWriterV2 API. ... To partition data when you create a Delta table, specify a...
Read more >Managing Spark Partitions with Coalesce and Repartition
Let's create a homerDf from the numbersDf with two partitions. ... It does not attempt to minimize data movement like the coalesce algorithm....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@krisvaz @hililiwei
https://github.com/apache/spark/blob/v3.1.2/python/pyspark/sql/readwriter.py#L1496-L1525 missing “self._jwriter.partitionedBy(col, cols)”
This issue has been closed because it has not received any activity in the last 14 days since being marked as ‘stale’