question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues with non-deterministic behavior when writing to Delta format

See original GitHub issue

When doing a regular partition overwrite over large datasets, we are encountering non-deterministic behavior from Delta.

We have a fairly complex ETL job which extracts, transforms and then loads data from Snowflake to an Azure storage container in Delta format. When executing the job multiple times, we get different results, even though we are loading the same data from Snowflake. We get strange duplicated values which are not supposed to occur:

version number of duplicated records
297 18663736
298 1491160
299 7369654
300 36846189
301 8647811
302 0

At each iteration we execute the same code and load the same data, yet we get different results and a varying number of duplicated records (desired number is 0).

After each transformation we do .cache().count() of the resulting data frame and no duplicates appear at those stages. It is only after we do the partition overwrite that the duplicated records materialize.

The way we overwrite the partitions is as follows:

def overwrite_delta(updates_df, *, delta_target=None, partition_fields=None, **kwargs):
    """Overwrite Delta table partitions with new data"""
    partition_to_overwrite = get_partition_values(
        updates_df=updates_df, partition_fields=partition_fields, prefix=""
    )
    print(f"Overwriting partition: {partition_to_overwrite}")
    (
        updates_df.write.format("delta")
        .mode("overwrite")
        .option("replaceWhere", partition_to_overwrite)
        .partitionBy(*partition_fields)
        .save(delta_target)
    )

I have not yet found a way to reproduce the issue with a small dataframe requiring simple transformations. It seems to be happening when we are loading a very large data frame and performing multiple Spark transformations on top of it, then writing it in Delta format to a specific location.

Environment information

  • Delta Lake version: 1.0.0
  • Spark version: 3.1.2

Running on Databricks Runtime 9.1 LTS.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bart-samwelcommented, Jun 21, 2022

I think this requires a deeper inspection of the code – the issue may be very subtle. I expect that it’s probably not OK to share that code in this public channel. At this point we should probably take this to a Databricks support ticket for further digging.

0reactions
aleksandraangelovacommented, Jul 1, 2022

Tests confirm that the repartioning command in one of our transformations is causing the duplicates. Usage of repartition(n) wihout providing key fields to repartition the files by is discouraged.

It’s not a problem when it’s executed once but it’s a problem when the operation is executed multiple times Some tasks are probably failing - e.g. if they are running on spot instances or the execution plan runs out of memory - and then the task gets restarted and the files are shuffled in a different order, causing duplicates.

It could be that the plan execution runs out of RAM memory and spills to disk - this itself may change the order the results are sent in.

Thank you, @bart-samwel , for all the help 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Apache Spark Jobs hang due to non-deterministic custom UDF
Cause. Sometimes a deterministic UDF can behave nondeterministically, performing duplicate invocations depending on the definition of the UDF.
Read more >
Table deletes, updates, and merges - Delta Lake Documentation
Learn how to delete data from and update data in Delta tables.
Read more >
Isolation levels and write conflicts on Databricks
Common causes are ALTER TABLE operations or writes to your Delta table that update the schema of the table.
Read more >
MERGE - Snowflake Documentation
When a merge joins a row in the target table against multiple rows in the source, the following join conditions produce nondeterministic results...
Read more >
Trouble when writing the data to Delta Lake in Azure ...
The file format should be specified along the supported formats: csv, txt, json, parquet or arvo. dataframe = spark.read.format('csv').load(path).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found