Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrameWriterV2 doesn't create partitions when using partitionedBy

See original GitHub issue

When I try using the DataFrameWriterV2 to create an Iceberg table, the partitioning columns seem to get ignored (thereby not partitioned) even though they are specified.

Spark version: 3.1.2 Iceberg version: 0.13.0

Note: The application is written in python.

orders_df.writeTo("iceberg_catalog.p2.orders").using("iceberg").option(
    "fanout-enabled", "true"
).partitionedBy("date(pickupts)", "appname").create()

Using Spark SQL to describe the table using this way shows this:

# Partitioning
Not partitioned

The workaround for creating the tables with the appropriate partitions is via Spark SQL.

i.e. this works as expected (see note below snippet):

orders_df.createOrReplaceTempView("p2_orders_view")
spark.sql(
    """
    create table iceberg_catalog.p2.orders using iceberg
    PARTITIONED BY (date(pickupts), appname)
    as select * from p2_orders_view
    """
)

Here, when using Spark SQL to describe the table using this above method to create the table shows this:

# Partitioning
Part 0	days(pickupts)
Part 1	appname

My question is, am I using the DataFrameWriterV2 correctly when specifying the partitionedBy columns?

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

liuchunhuacommented, Apr 3, 2022

@krisvaz @hililiwei

https://github.com/apache/spark/blob/v3.1.2/python/pyspark/sql/readwriter.py#L1496-L1525 missing “self._jwriter.partitionedBy(col, cols)”

0reactions

github-actions[bot]commented, Nov 17, 2022

This issue has been closed because it has not received any activity in the last 14 days since being marked as ‘stale’

Top Results From Across the Web

DataFrameWriterV2 (Spark 3.0.0 JavaDoc)

Partition the output table created by create , createOrReplace , or replace using the given columns or transforms. When specified, the table data...

Writing multiple partition specs to Apache Iceberg table

For batch writes, partition expressions are now supported in partitionBy using Spark 3 new DataFrameWriterV2 API:

[SPARK-28612][SQL] Add DataFrameWriterV2 API

Does not allow table configuration options for operations that cannot ... Unit + + /** + * Partition the output table created by...

Table batch reads and writes - Delta Lake Documentation

You can also create Delta tables using the Spark DataFrameWriterV2 API. ... To partition data when you create a Delta table, specify a...

Managing Spark Partitions with Coalesce and Repartition

Let's create a homerDf from the numbersDf with two partitions. ... It does not attempt to minimize data movement like the coalesce algorithm....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

DataFrameWriterV2 doesn't create partitions when using partitionedBy

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Spark: SparkSQL call procedures blocking(expire_snapshots and delete orphan files)

Expose human-readable metrics in metadata tables