Issues with non-deterministic behavior when writing to Delta format
See original GitHub issueWhen doing a regular partition overwrite over large datasets, we are encountering non-deterministic behavior from Delta.
We have a fairly complex ETL job which extracts, transforms and then loads data from Snowflake to an Azure storage container in Delta format. When executing the job multiple times, we get different results, even though we are loading the same data from Snowflake. We get strange duplicated values which are not supposed to occur:
version | number of duplicated records |
---|---|
297 | 18663736 |
298 | 1491160 |
299 | 7369654 |
300 | 36846189 |
301 | 8647811 |
302 | 0 |
At each iteration we execute the same code and load the same data, yet we get different results and a varying number of duplicated records (desired number is 0).
After each transformation we do .cache().count() of the resulting data frame and no duplicates appear at those stages. It is only after we do the partition overwrite that the duplicated records materialize.
The way we overwrite the partitions is as follows:
def overwrite_delta(updates_df, *, delta_target=None, partition_fields=None, **kwargs):
"""Overwrite Delta table partitions with new data"""
partition_to_overwrite = get_partition_values(
updates_df=updates_df, partition_fields=partition_fields, prefix=""
)
print(f"Overwriting partition: {partition_to_overwrite}")
(
updates_df.write.format("delta")
.mode("overwrite")
.option("replaceWhere", partition_to_overwrite)
.partitionBy(*partition_fields)
.save(delta_target)
)
I have not yet found a way to reproduce the issue with a small dataframe requiring simple transformations. It seems to be happening when we are loading a very large data frame and performing multiple Spark transformations on top of it, then writing it in Delta format to a specific location.
Environment information
- Delta Lake version: 1.0.0
- Spark version: 3.1.2
Running on Databricks Runtime 9.1 LTS.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
I think this requires a deeper inspection of the code – the issue may be very subtle. I expect that it’s probably not OK to share that code in this public channel. At this point we should probably take this to a Databricks support ticket for further digging.
Tests confirm that the repartioning command in one of our transformations is causing the duplicates. Usage of repartition(n) wihout providing key fields to repartition the files by is discouraged.
It’s not a problem when it’s executed once but it’s a problem when the operation is executed multiple times Some tasks are probably failing - e.g. if they are running on spot instances or the execution plan runs out of memory - and then the task gets restarted and the files are shuffled in a different order, causing duplicates.
It could be that the plan execution runs out of RAM memory and spills to disk - this itself may change the order the results are sent in.
Thank you, @bart-samwel , for all the help 😃